We value your feedback and insights. If you have suggestions, ideas, or questions, feel free to comment on existing issues. Engaging in discussions helps shape Monty's direction and ensures that the community's needs and concerns are considered.
Please consider the following when commenting.
Remember the people involved in the Pull Request. Be welcoming. If there is a conflict or coming to an agreement is difficult, having an audio or video call can help. Remember to always be polite and that everyone is just trying to help in their own way.
When commenting on issues, please favor gaining an understanding of what is being communicated. What is the other person's context?
We encourage everyone to reproduce Bug Reports. If you generate a reproduction, please comment on the Issue with your reproduction to verify the problem.
If you begin working on a Feature Request, it is helpful to let people know. While Feature Requests are not assigned to specific people, in some cases, it may be beneficial to discuss them with others who are also working on them.
📘 This page is about contributing Tutorials
See here for current tutorials.
Tutorials are a great way for people to get hands-on-experience with our code and to learn about our approach. They should contain a mix of working code, text explaining the code, and images visualizing the concepts explained and any results from executing the code. They should be an easy-to-follow-along resource for people who are new to this project. They should not require someone to have read the rest of the documentation but can link to other documentation pages for further reading.
We deeply appreciate people who take the time to write tutorials for others. Especially if you have just come to this project recently you may be the best person to explain the code to other new-comers as you are coming to it with a fresh mind and remember the concepts you struggled with.
If you like to contribute a tutorial on a specific topic, the best place to start is to have a look at our existing tutorials to get an idea of how you can structure it. Then you can create a new .md file inside the tutorials folder and write your tutorial. Please make sure to include working code so that people can just copy it into a Python file and follow along. Also include visuals wherever possible. See our contributing documentation guide for more details. Finally, don't forget to add your tutorial and a short description to the list of tutorials here.
Welcome, and thank you for your interest in contributing to Monty!
We appreciate all of your contributions. Below, you will find a list of ways to get involved and help create AI based on principles of the neocortex.
There are many ways in which you can contribute to the code. For some suggestions, see the Contributing Code Guide.
Monty integrates code changes using GitHub Pull Requests. For details on how Monty uses Pull Requests, please consult the Contributing Pull Requests guide.
This project has many aspects and we try to document them all here in sufficient detail. However, you are the ultimate judge of whether this documentation is sufficient and what kind of information is missing. We try to update the documentation based on questions asked on our communication channels but we are always happy to get more help with this.
Also, if you are contributing to our code, it helps everyone else to include a corresponding update to the documentation in your PR.
Please see our guide on contributing documentation.
As a Monty user or contributor, you can help others become familiar with different aspects of Monty. We are always looking for new approaches to ease the introduction to Monty and its concepts. See the Contributing Tutorials guide on how you could add easy-to-follow tutorials for other users. You can also become active on our forum and help answer questions from others.
You can find the researchers, developers, and users of Monty on the Thousand Brains Forum. Please join us there for active discussion on all things Monty and Thousand Brains.
The vision of this project is to implement a generally intelligent system that can solve a wide variety of sensorimotor tasks (at a minimum, any task the neocortex can solve/humans can perform with ease). To evaluate this, we are continually looking for test beds in which we can assess the system's capabilities. If you have an idea of how to test an important capability or you have an existing benchmark on which you want to compare our algorithm, please consider contributing these.
If you encounter any unexpected behavior, please consider creating a new Bug Report issue if it has not yet been reported.
We value your feedback and insights. If you have suggestions, ideas, or questions, feel free to comment on existing issues. Engaging in discussions helps shape Monty's direction and ensures that the community's needs and concerns are considered. For further details, please refer to the Commenting on Issues Guide.
Reviewing Pull Requests is a great way to contribute to the Monty project and get familiar with the code. By participating in the review process, you help ensure the quality, security, and functionality of the Monty codebase. For details on conducting a review, please refer to the Pull Request Review Guide
We look forward to your ideas. For smaller changes, consider creating a new Feature Request issue and Submit a Pull Request (see below). For substantial changes, we have a Request For Comments (RFC) process designed to let Core Maintainers understand and comment on your idea before starting implementation. If you are unsure, go ahead and create a new Feature Request issue, and if it can benefit from an RFC, Maintainers will comment and let you know.
We love seeing how you use Monty. If you created something interesting with it, whether a project, research paper, application, or blog post, share it with us.
- Showcase Your Projects: Submit your projects to be featured on our Showcase Page. This is a great way to highlight your work and inspire others.
- Write a Blog Post: Share your experience and insights by writing a blog post. Please share your post with the community on our Discourse server.
- Publish a Paper: If you use our Monty implementation or ideas from the Thousand Brains Theory in your next publication, we would like to feature you on our TBP-based papers list and increase the visibility of your research.
- Present at Community Events: We host regular webinars and community meetups. If you are interested in presenting your project or research, please get in touch with us at [email protected].
- Social Media: Share your creations on social media using the hashtag
#1000brainsproject. Follow us on X, Bluesky or LinkedIn, and subscribe to our YouTube Channel, or our email list.
Thank you for being a member of our community. By using Monty, you are already promoting Monty and the Thousand Brains Project. If you like our project, we are happy to see you mention us in your social media posts or privately to friends and colleagues.
If you want to discuss further opportunities, such as mentioning us in a blog post or newspaper article or recording an interview with us, don't hesitate to contact [email protected].
If you are a research lab, government institution, or company and you think a closer collaboration with our team could be mutually beneficial, please reach out to us at [email protected].
📘 This page is about contributing Documentation
For current Documentation see Getting Started
Our documentation is held in Markdown files in the Monty repo under the /docs folder. This documentation is synchronized to readme.com for viewing whenever a change is made. The order of sections, documents, and subdocuments is maintained by a hierarchy file called /docs/hierarchy.md. This is a fairly straightforward Markdown document that is used to tell readme how to order the categories, documents and sub-documents.
📘 Edits to the documentation need to be submitted in the form of PRs to the Monty repository.
We use Vale to lint our documentation. The linting process checks for spelling errors and ensures that headings follow the APA title case style.
The linting rules are defined in the /.vale/ directory.
You can add new words to the dictionary by adding them to .vale/styles/config/vocabularies/TBP/accept.txt - for more information see Vale's documentation.
-
Install Vale Download Vale from its installation page.
-
Run Vale Use the following command in your terminal to run Vale:
vale .Example output:
➜ tbp.monty git:(main) vale . ✔ 0 errors, 0 warnings and 0 suggestions in 141 files.
Links to other documents should use the standard Markdown link syntax, and should be relative to the documents location.
relative link in the same directory
[Link Text](placeholder-example-doc.md)
a relative link, with a deep-link to a heading
[Link Text](../contributing/placeholder-example-doc.md#relative-links)These links will work even if you're on a designated version of the documentation.
This is the simplest flow. To modify a document simply edit the Markdown file in your forked version of the Monty repository and commit the changes by following the normal Pull Requests process.
To create a new document, create the new file in the category directory, then add a corresponding line in the /docs/hierarchy.md file.
# my-category: My Category
- [my-new-doc](/my-category/new-placeholder-example-doc.md)
- [some-existing-doc](/my-category/placeholder-example-doc.md)Then, create your Markdown document /docs/my-category/new-placeholder-example-doc.md and add the appropriate Frontmatter.
---
title: 'New Placeholder Example Doc'
---
# My first heading🚧 Quotes
Please put the title in single quotes and, if applicable, escape any single quotes using two single quotes in a row. Example:
title: 'My New Doc''s'
🚧 Your title must match the URL-safe slug
If your title is
My New Doc'sthen your file name should bemy-new-docs.md
Continue with the Pull Requests process.
Documents that are nested under other documents require that you create a folder with the same name as the parent document but without the .md extension. Then, you place any sub-documents in that folder. For example, if you were creating a document called new-placeholder-example-doc.md beneath the document Category One/some-existing-doc.md file you would create a folder called category-one/some-existing-doc and place the new document in that folder.
And then update the hierarchy.md file
# category-one: Category One
- [some-existing-doc](category-one/some-existing-doc.md)
- [new-doc](category-one/some-existing-doc/new-placeholder-example-doc.md)
# category-two: Category Two
...Continue with the Pull Requests process.
If the move is within a category or sub-pages within a page, you can simply edit the hierarchy.md file and update the locations by moving the lines around.
If you are changing the parent path of a document (i.e., sub-page -> page, or page -> sub-page, or page/sub-page -> new category, or sub-page -> new page, then along with updating the hierarchy.md file you also must update the folder structure to make sure the document is correctly located. The sync tool will fail with a pre-check error saying there is a mismatch between the hierarchy file and the location on disk if they do not match up.
Continue with the Pull Requests process.
🚧 You cannot reorder categories as the readme.com API does not support this.
Changes to the category order should be done in the readme.com UI and reflected in the
hierarchy.mdfile.
If a document is well established (it has been around for more than 6 months), people may be using permalinks to it. Therefore, it is a good idea to create a redirect file rather than deleting or renaming it. To do this, set the document to hidden with a relocation link to a relevant area or new document location. Hidden files are reachable from the URL, just not shown in the navigation.
---
title: 'Badly Named Doc'
hidden: true
---
> ⚠️ this document has moved to <insert link>Continue with the usual Pull Requests process.
To create a new category, simply create a new folder inside the /docs folder and add a reference to it in the hierarchy.md file. Categories in the hierarchy file need a slug and title separated by a colon.
# category-one: Category One
# category-two: Category TwoIn our documentation sync tool there is a flag to check internal links, image references and hierarchy file references. This is a good way to ensure that all links are working correctly before submitting a PR.
To check the links, activate the conda environment, and then run the following command:
python -m tools.github_readme_sync.cli check docs
Note
See the readme sync tool documentation for more details on how to use it and how to install the additional dependencies for it.
See the Style Guide images section for details about creating and referencing images correctly.
👍 You have access to VS Code snippets
When you checkout the repository, you have access to markdown snippets for tables, code blocks, warnings and more. While your cursor is in a markdown file, press CMD + Shift + P and select Insert snippet to select a desired documentation snippet.
The documentation Style Guide
The Monty documentation uses the first two parts of semantic versioning (semver), as there is nothing to document for patch changes. You can read about semver here https://semver.org/.
Monty uses GitHub Pull Requests to integrate code changes.
Before we can accept your contribution, you must sign the Contributor License Agreement (CLA). You can view and sign the CLA now or wait until you submit your Pull Request.
See the Contributor License Agreement page for more on the CLA.
Before submitting a Pull Request, you should set up your development environment to work with Monty. See the development Getting Started guide for more information.
- Identify an issue to work on.
- Ensure your fork has the latest upstream
mainbranch changes (if you don't have a fork of the Monty repository or aren't sure, see the development Getting Started guide):git checkout main git pull --rebase upstream main
- Create a new branch on your fork to work on the issue:
git checkout -b <my_branch_name>
- Implement your changes. Keep in mind any tests or benchmarks that you may need to add or update.
- If you've added/deleted/modified code, test your changes locally via:
pytest
- Push your changes to your branch on your fork:
git push
- Create a new GitHub Pull Request from your fork to the official Monty repository.
- Respond to and address any comments on your Pull Request. See Pull Request Flow for what to expect.
- Once your Pull Request is approved, it will be merged by one of the Maintainers. Thank you for contributing! 🥳🎉🎊
- It is recommended to add unit tests for any new feature you implement. This makes sure that your feature continues to function as intended when other people (or you) make future changes to the code. To get a detailed coverage report use
pytest --cov --cov-report html. - Run
pytest,ruff check, andruff formatto make sure your changes don't break any existing code and adhere to our style requirements. If your code doesn't pass these, it can not be merged. - Make sure that your code is properly documented. Please refer to our Style Guide for instructions on how to format your comments.
- If applicable, please also update or add to the documentation on readme.com. For instructions on how to do this, see our guide on contributing documentation.
- Use callbacks for logging, and don’t put control logic into logging functions.
- Note that the random seed in Monty is handled using a generator object that is passed
where needed, i.e. by initializing the random number generator with
This rng is then passed to the various classes, and can be accessed in the sensor modules, learning modules, and motor system with self.rng. Thus to use a random numpy method, call it with e.g.
rng = np.random.RandomState(experiment_args["seed"])
self.rng.uniform()rather thannp.random.uniform()
Note
This page is about contributing RFCs
For existing RFCs, see the rfcs folder.
For RFCs in progress, see issues with rfc:proposal label.
The Request for Comments (RFC) process is intended to provide a consistent and controlled path for substantial changes to Monty so that all stakeholders can be confident about the project's direction. It should also help avoid situations where a contributor spends a lot of time implementing a feature or change that eventually does not get merged.
Many changes, including bug fixes, smaller changes, and documentation improvements, can be implemented and reviewed via Pull Requests.
Substantial changes should undergo a design process and create consensus among the Monty community and Maintainers.
If you are unsure whether you intend to work on a substantial change, create a new Feature Request issue. Maintainers will comment and let you know if it can benefit from an RFC.
The process here is intended to be as lightweight as reasonable for the present circumstances and not impose more structure than necessary. If you feel otherwise, please consider creating an RFC to update this process.
You need to follow this process if you intend to make substantial changes to Monty or its associated open-source framework and workflows. What constitutes a substantial change is evolving based on community norms and varies depending on what part of the ecosystem you are proposing to change, but may include the following:
- Any fundamental iteration of the Monty architecture
- Removing Monty features
- Any changes to the Cortical Message Protocol (CMP)
- Breaking changes to the API
- Diverging from Monty's brain-inspired philosophy and principles
Some changes do not require an RFC:
- Rephrasing, reorganizing, refactoring, or otherwise "changing shape does not change meaning."
- Additions that strictly improve objective, numerical quality criteria (warning removal, speedup, better platform coverage, more parallelism, etc.)
If you submit a pull request for a substantial change without going through the RFC process, it may be closed with a request to submit an RFC first.
A hastily proposed RFC can hurt its chances of acceptance. Low-quality proposals, proposals for previously rejected features, or those that don't fit into the near-term roadmap may be quickly rejected, which can demotivate the unprepared contributor. Laying some groundwork ahead of the RFC can make the process smoother. Please have a look at some of our past RFCs to see their scope. rfcs folder.
Although there is no single way to prepare for submitting an RFC, it is generally a good idea to pursue feedback from other project developers beforehand to ascertain that the RFC may be desirable; having a consistent impact on the project requires concerted effort toward consensus-building.
The most common preparation for writing and submitting an RFC is discussing the idea on the Monty Researcher/Developer Forum or submitting an issue or feature request.
To contribute a substantial change to Monty, you must first merge the RFC into the Monty repository as a markdown file. At that point, the RFC is active and may be implemented with the goal of eventual inclusion into Monty.
- Fork the Monty repository (see the development Getting Started guide)
- Copy one of the templates. The templates are intended to get you started from something other than a blank page. You can use the minimal template
rfcs/0000_minimal_template.md, or for more detailed proposals, you can use the comprehensive templaterfcs/0000_comprehensive_template.md. Whichever template you use, copy it torfcs/0000_my_proposal.md(where "my_proposal" is a short but descriptive title). Don't assign an RFC number yet. The file will be renamed accordingly if the RFC is accepted. - Author the RFC. Consider your proposal. Please feel free to include images, diagrams, or examples if they help explain your proposal. For an example of a small accepted RFC, see rfcs/0005_easier_rfcs.md. For an example of a comprehensive accepted RFC, see rfcs/0004_action_object.md.
- Submit an RFC Pull Request. As a pull request, the RFC will receive design feedback from the broader community, and you should be prepared to revise it in response. Title your pull request starting with
RFCand the title of your proposal. For example,RFC My Proposal. - Each RFC Pull Request will be triaged, given an
rfc:proposallabel, and assigned to a Maintainer who will serve as your primary point of contact for the RFC. If you are a Maintainer and you wrote the RFC, you are the point of contact and you should assign the RFC Pull Request to yourself. - Build consensus and integrate feedback. RFCs with broad support are much more likely to make progress than those that don't receive any comments. Contact the RFC Pull Request assignee for help identifying stakeholders and obstacles.
- In due course, one of the Maintainers will propose a "motion for final comment period (FCP)" along with the disposition for the RFC (merge or close). If you are a Maintainer and the author of the RFC, you can propose the FCP yourself.
- This can happen quickly, or it can take a while.
- This step is taken when enough of the tradeoffs have been discussed so that the Maintainers can decide. This does not require consensus amongst all participants in the RFC thread (which is usually impossible). However, the argument supporting the disposition of the RFC needs to have already been clearly articulated, and there should not be a strong consensus against that position. Maintainers use their best judgment in taking this step, and the FCP itself ensures there is ample time and notification for stakeholders to push back if it is made prematurely.
- For RFCs with lengthy discussions, the motion for FCP is usually preceded by a summary comment that attempts to outline the current state of the discussion and major tradeoffs/points of disagreement.
- The FCP lasts until all Maintainers approve or abstain from the disposition. This way, all stakeholders can lodge any final objections before reaching a decision.
- Once the FCP elapses, the RFC is either merged or closed. If substantial new arguments or ideas are raised, the FCP is canceled, and the RFC goes back into development mode. The assigned Maintainer is the one responsible for merging or closing.
[!NOTE] Maintainers
- Prior to merging the Pull Request, make one last commit:
- Assign the next available sequential number to the RFC
- Rename Pull Request to include the RFC number, e.g., RFC 4 Action Object
- Rename the
rfcs/0000_my_proposal.mdaccordingly. - Update asset folder and any links to assets in that folder if present, like
rfcs/0000_my_proposal/ - Provide the link to the RFC Pull Request in the
RFC PRfield at the top of the RFC text
- Merge the Pull Request. The commit message should consist of the
rfc:prefix, RFC number, title, and pull request number, e.g.,rfc: RFC 3 No Three Day Wait (#366) - Create an Issue in the project that tracks implementation of the merged RFC to fulfill the "Every accepted RFC has an associated issue, tracking its implementation in the Monty repository" requirement. Issue title should be RFC number, title, and the word "Implementation", e.g.,
RFC 4 Action Object Implementation.
- Prior to merging the Pull Request, make one last commit:
Once an RFC becomes active then contributors may implement it and submit the implementation as a pull request to the Monty repository. Being active does not guarantee the feature will ultimately be merged. It does mean that in principle, all the major stakeholders have agreed to the feature and are amenable to merging it.
Furthermore, the fact that a given RFC has been accepted and is active implies nothing about what priority is assigned to its implementation, nor does it imply anything about whether a Maintainer has been assigned the task of implementing it. While it is not necessary that the author of the RFC also write the implementation, it is by far the most effective way to see an RFC through to completion: Authors should not expect that others will take on responsibility for implementing their accepted RFC.
Modifications to active RFCs can be done in follow-up Pull Requests. We strive to write each RFC in a manner that will reflect the final design of the feature. Still, the nature of the process means that we cannot expect every merged RFC to reflect what the end result will be at the time of the next major release; therefore, we try to keep each RFC document somewhat in sync with how the feature is actually being implemented, tracking such changes via follow-up pull requests to the document.
In general,** once accepted, RFCs should not be substantially changed**. Only very minor changes should be submitted as amendments. More substantial changes should be new RFCs, with a note added to the original RFC. Exactly what counts as a very minor change is up to the Maintainers to decide.
While the RFC pull request is open, Maintainers may schedule meetings with the author and relevant stakeholders to discuss the issues in greater detail. A summary from each meeting will be posted back to the RFC pull request.
Maintainers make final decisions about RFCs after the benefits and drawbacks are well understood. These decisions can be made at anytime and Maintainers will regularly make them. When a decision is made, the RFC pull request will be merged or closed. In either case, if the reasoning from the thread discussion is unclear, Maintainers will add a comment describing the rationale for the decision.
Some accepted RFCs represent vital features that need to be implemented right away. Other accepted RFCs can represent features that can wait until someone feels like doing the work. Every accepted RFC has an associated issue, tracking its implementation in the Monty repository.
The author of an RFC is not obligated to implement it. Anyone, including the author, is welcome to submit an implementation for review after the RFC has been accepted.
If you are interested in working on the implementation of an active RFC, but cannot determine if someone else is already working on it; feel free to ask (e.g., by leaving a comment on the associated issue).
See the Code Style Guide.
See the Typing Guide.
We use GitHub Actions to run our continuous integration workflows.
The workflow file name is the workflow name in snake_case, e.g., potato_stuff.yml.
The workflow name is a human-readable descriptive Capitalized Case name, e.g.,
name: Docsname: Montyname: Toolsname: Potato StuffThe job name when in position of a key in a jobs: dictionary is a human-readable snake_case ending with _<workflow_name>.
When used as a value for the name: property, the job name is human-readable kebab-case ending with -<workflow-name>, e.g.,
jobs:
check_docs:
name: check-docsjobs:
install_monty:
name: install-montyjobs:
test_tools:
name: test-toolsjobs:
check_style_potato_stuff:
name: check-style-potato-stuffIn general we try and stick to native markdown syntax, if you find yourself needing to use HTML, please chat with the team about your use case. It might be something that we build into the sync tool.
In a document your first level of headings should be the # , then ## and so on. This is slightly confusing as usually # is reserved for the title, but on readme.com the h1 tag is used for the actual title of the document.
Use headings to split up long text blocks into manageable chunks.
Headings can be referenced in other documents using a hash link [Headings](doc:style-guide#headings). For example Style Guide - Headings
All headings should use capitalization following APA convention. For detailed guidelines see the APA heading style guide and this can be tested with the Vale tool and running vale . in the root of the repo.
Footnotes should be referenced in the document with a [1] notation that is linked to a section at the bottom # Footnotes
For example
This needs a footnote[1](#footnote1)
# Footnotes
<a name="footnote1">1</a>: Footnote text
Images should be placed in /docs/figures in the repo.
Images use snake_case.ext
Images should generally be png or svg formats. Use jpg if the file is actually a photograph.
Upload high quality images as people can click on the image to see the larger version. You can add style attributes after the image path with #width=300px or similar.
For example, the following markdown creates the image below:
Warning
Caption text is only visible on readme.com
You can inline CSV data tables in your markdown documents. The following example shows how to create a table from a CSV file:
!table[../../benchmarks/example-table-for-docs.csv]
The CSV contains the following data:
Year, Avg Global Temp. (°C), Pirates | align right | hover Pirate Count
1800, 14.3, 50 000
1850, 14.4, 15 000
1900, 14.6, 5 000
1950, 14.8, 2 000
2000, 15.0, 500
2020, 15.3, 200
Which produces the following table:
!table[style-guide.csv]
Note that the CSV header row has bar separated syntax that allows you to specify the alignment of the columns left or right and the hover text.
Readme supports four color coded callouts
> 👍 Something good
👍 Something good
> 📘 Information
📘 Information
> ⚠️ Warning
⚠️ Warning
> ❗️ Alert
❗️ Alert
Billions of people use commas as a thousands separator, and billions use the period as the thousands separator. As this documentation is expected to be widely used, we will use space as the separator, as this is the internationally recommended convention.
For example, 1 million is written numerically as 1 000 000.
Note
For an architecture overview see the Architecture Overview page. Each of the major components in the architecture can be customized. For more information on how to customize different modules, please see our guide on Customizing Monty.
There are many ways in which you can contribute to the code. The list below is not comprehensive but might give you some ideas.
- Create a Custom Sensor Module: Sensor Modules are the interface between the real-world sensors and Monty. If you have a specific sensor that you would like Monty to support, consider contributing a Sensor Module for it. Also, if you have a good idea of how to extract useful features from a raw stream of data, this can be integrated as a new Sensor Module.
- Create a Custom Learning Module: Learning Modules are the heart of Monty. They are the repeating modeling units that can learn from a stream of sensorimotor data and use their internal models to recognize objects and suggest actions. What exactly happens inside a Learning Module is not prescribed by Monty. We have some suggestions, but you may have a lot of other ideas. As long as a Learning Module adheres to the Cortical Message Protocol and implements the abstract functions defined here, it can be used in Monty. It would be great to see many ideas for Learning Modules in this code base that we can test and compare. For information on how to implement a custom Learning Module, see our guide on Customizing Monty.
- Write a Custom Motor Policy: Monty is a sensorimotor system, which means that action selection and execution are important aspects. Model-based action policies are implemented within the Learning Module's Goal State Generator, but model-free ones, as well as the execution of the suggested actions from the Learning Modules, are implemented in the motor system. Our Thousand Brains Project team doesn't have much in-house robotics experience so we value contributions from people who do.
- Add Support for More Environments: If you know of other environments that would be interesting to test Monty in (whether you designed it or it is a common benchmark environment) you can add a custom
EnvironmentInterfaceto support this environment. - Improve the Code Infrastructure: Making the code easier to read and understand is a high priority for us, and we are grateful for your help. If you have ideas on how to refactor or document the code to improve this, consider contributing. We also appreciate help on making our unit test suite more comprehensive. Please create an RFC before working on any major code refactor.
- Optimize the Code: We are always looking for ways to run our algorithms faster and more efficiently, and we appreciate your ideas on that. Just like the previous point, PRs around this should not change anything in the outputs of the system.
- Add to our Benchmarks: If you have ideas on how to test more capabilities of the system we appreciate if you add to our benchmark experiments. This could be evaluating different aspects in our current environments or adding completely new environments. Please note that in order to allow us to frequently run all the benchmark experiments, we only add one experiment for each specific capability we test and try to keep the run times reasonable.
- Work on an open Issue: If you came to our project and want to contribute code but are unsure of what, the open Issues are a good place to start. See our guide on how to identify an issue to work on for more information.
Monty integrates code changes using GitHub Pull Requests. To start contributing code to Monty, please consult the Contributing Pull Requests guide.
We are excited about all contributors and there may be a wide range of motivations you may have for contributing. Here is a (non exhaustive) list of what those reasons might be and benefits you may have from contributing.
- You are tired of incremental progress on ANN benchmarks.
- You don't believe that LLMs are the path to true machine intelligence/understanding the brain.
- You want to do exciting research but don't have a big compute budget.
- You are looking for a wide open space to explore new ideas.
- You want to solve tasks where little training data is available.
- You want to solve sensorimotor tasks.
- You want to solve a task that requires quick, continuous learning and adaptation.
- You want to better understand the brain and principles underlying our intelligence.
- You want to work on the future of AI.
- You want to be part of a truly unique and special project.
Here is a list of concrete output you may get out of working on this project.
- Write a publication.
- Write your bachelor or master thesis on the thousand brains approach.
- Be part of an awesome community.
- Have your project showcased on our showcase page.
- Have your paper listed on our TBP based papers page.
- Become a code contributor.
- Lastly, for those out there who love achievements, note that when an RFC you have made is merged and active, you can get a player icon of your choice on our project roadmap. Maybe we'll see you there soon? 🎯
As we are putting this code under an MIT license and Numenta has put its related patents under a non-assert pledge, people can also build commercial applications on this framework. However, the current code is very much research code and not an out-of-the-box solution so it will require significant engineering effort to tailor it to your application.
Note
For Maintainers
The philosophy behind triage is to check issues for validity and accept them into the various Maintainer workflows. Triage is not intended to engage in lengthy discussions on Issues or review Pull Requests. These are separate activities.
The typical triage outcomes are:
- Label and accept the Issue or Pull Request with
triaged. - Label, request more information, and mark the Issue with
needs discussionandtriaged. - Reject by closing the Issue or Pull Request with
invalid.
Note
Triage link (is issue, is open, is not triaged)
https://github.com/thousandbrainsproject/tbp.monty/issues?q=is:issue+is:open+-label:triaged
The desired cadence for Issue Triage is at least once per business day.
A Maintainer will check the Issue for validity.
Do not reproduce or fix bugs during triage.
A short descriptive title.
Ideally, the Issue creator followed the instructions in the Issue templates.
If not, and more information is needed, do not close the Issue. Instead, proceed with triage, request more information by commenting on the Issue, and add a needs discussion label to indicate that additional information is required. Remember to add the triaged label to indicate that the Issue was triaged after you applied any additional labels.
A valid Issue is on-topic, well-formatted, contains expected information, and does not violate the code of conduct.
Multiple labels can be assigned to an Issue.
bug: Apply this label to bug reports.documentation: Apply this label if the Issue relates to documentation without affecting code.enhancement: Apply this label if the Issue relates to new functionality or changes in functional code.infrastructure: Apply this label if the Issue relates to infrastructure like GitHub, continuous integration, continuous deployment, publishing, etc.invalid: Apply this label if you are rejecting the Issue for validity.needs discussion: Apply this label if the Issue is missing information to determine what to do with it.triaged: At a minimum, apply this label if the Issue is valid and you have triaged it.
Do not assign priority or severity to Issues (see: RFC 2 PR and Issue Review).
Do not assign Maintainers to Issues. Issues remain unassigned so that anyone can work on them (see: RFC 2 PR and Issue Review).
If you feel that someone should be notified of the Issue, make a comment and mention them in the comment.
The desired cadence for Pull Request Triage is at least once per business day.
First, review any Pull Requests pending CLA.
Note
Pending CLA link (is pull request, is open, is not a draft, is not triaged, is pending CLA)
If the Pull Request CLA check is passing (you may need to rerun the CLA check), remove the cla label.
Note
Triage link (is pull request, is open, is not a draft, is not triaged, is not pending CLA)
First, check if the Pull Request CLA check is passing. If the check is not passing, add the cla label and move on to the next Pull Request. The skipped Pull Request will be triaged again after the CLA check is passing.
A Maintainer will check the Pull Request for validity.
There is no priority or severity applied to Pull Requests.
A valid Pull Request is on-topic, well-formatted, contains expected information, does not depend on an unmerged Pull Request, and does not violate the code of conduct.
A Draft Pull Request is ignored and not triaged.
A short descriptive title.
If the Pull Request claims to resolve an Issue, that Issue is linked and valid.
If the Pull Request is standalone, it clearly and concisely describes what is being proposed and changed.
If the Pull Request is related to a previous RFC process, the RFC document is referenced.
Pull Request branches from a recent main commit.
Pull Request does not depend on another unmerged Pull Request.
Note
Pull Requests that depend on unmerged Pull Requests add unnecessary complexity to the review process: Maintainers must track the status of multiple Pull Requests and re-review them if the dependent Pull Request is updated. Such dependency is much easier for the Pull Request author to track and to submit the Pull Request after all dependent code is already merged to main.
It is OK if the commit history is messy. It will be "squashed" when merged.
Multiple labels can be assigned to a Pull Request. For example, an enhancement can come with documentation and continue along the Pull Request Flow after being triaged.
cla: Apply this label if the Pull Request CLA check is failing.documentation: Apply this label if the Pull Request relates to documentation without affecting code.enhancement: Apply this label if the Pull Request implements new functionality or changes functional code.infrastructure: Apply this label if the Pull Request concerns infrastructure such as GitHub, continuous integration, continuous deployment, publishing, etc.invalid: Apply this label if you are rejecting the Pull Request for validity.rfc:proposal: Apply this label if the Pull Request is a Request For Comments (RFC).triaged: At a minimum, apply this label if the Pull Request is valid, you triaged it, and it should continue the Pull Request Flow.
Do not assign priority or severity to Pull Requests.
Use your judgement to assign the Pull Request to one or more Maintainers for review (this is the GitHub Assignees feature). If you are the author, you may not assign the Pull Request to yourself. Note that for RFCs, the meaning of the Assignees list is different (see Request For Comments (RFC)).
Use your judgement to request a review for the Pull Request from one or more people (this is the GitHub Reviewers feature).
If you feel that someone should be notified of the Issue, make a comment and mention them in the comment.
No. Multiple people can work on the same issue, and different people can submit multiple pull requests for the same issue. Ultimately, which pull request will be merged will be decided by Maintainers during the Review. However, to avoid double efforts, it is encouraged to comment on an issue when you start working on it to inform others about this. Also, if you decide to stop working on an issue, it is helpful to leave a comment so that someone else can pick up the issue and know what roadblocks you may have hit.
If you have nothing specific to work on or want to become more familiar with the project, you can start by finding existing issues in the Issue Tracker. Every triaged issue contains labels that provide additional information.
- Issues with a
good first issuelabel should be appropriate for first-time committers and don't usually require broad changes across the code base or a deep understanding of the algorithms intricacies.
If you decide to start work on one of the issues, please leave a comment to let Maintainers and others know.
Before creating a new issue, please check if a similar issue exists in the Issue Tracker. Perhaps someone already reported it, and someone might even be working on it.
If you do not see a similar issue reported, please create a New issue using one of the provided templates.
The future work documents have special Frontmatter metadata that is used to power the future-work widget.
Here is an example of what the Frontmatter fields look like:
---
title: Future Work Widget
rfc: https://github.com/thousandbrainsproject/tbp.monty/blob/main/rfcs/0015_future_work.md
estimated-scope: medium
improved-metric: community-engagement
output-type: documentation
skills: github-actions, python, github-readme-sync-tool, S3, JS, HTML, CSS
contributor: codeallthethingz
status: in-progress
---The following fields are validated against allow lists defined in snippet files to ensure consistency and quality.
Tags is a comma separated list of keywords, useful for filtering the future work items. Edit future-work-tags.md.
!snippet[../../snippets/future-work-tags.md]
Skills is a comma separated list of skills that will be needed to complete this. Edit future-work-skills.md.
!snippet[../../snippets/future-work-skills.md]
Very roughly, how big of a chunk of work is this? Edit future-work-estimated-scope.md.
!snippet[../../snippets/future-work-estimated-scope.md]
Is the work completed, or is it in progress? Edit future-work-status.md.
!snippet[../../snippets/future-work-status.md]
What type of improvement does this work provide? Edit future-work-improved-metric.md.
!snippet[../../snippets/future-work-improved-metric.md]
What type of output will this work produce? Edit future-work-output-type.md.
!snippet[../../snippets/future-work-output-type.md]
Does this work item required an RFC? (These values are processed in the validator.py code) and can be of the form:
https://github\.com/thousandbrainsproject/tbp\.monty/.* required optional not-required
The contributor field should be GitHub usernames, as these are converted to their avatar inside the table.(These values are processed in the validator.py code) and can be of the form:
[a-zA-Z0-9][a-zA-Z0-9-]{0,38}
We follow the PEP8 Python style guide.
Additional style guidelines are enforced by Ruff and configured in pyproject.toml.
To quickly check if your code is formatted correctly, run ruff check in the tbp.monty directory.
We use Ruff to check proper code formatting with a line length of 88.
A convenient way to ensure your code is formatted correctly is using the ruff formatter. If you use VSCode, you can get the Ruff VSCode extension and set it to format on save (modified lines only) so your code always looks nice and matches our style requirements.
We adopted the Google Style for docstrings. For more details, see the Google Python Style Guide - 3.8 Comments and Docstrings.
After discovering that PyTorch-to-NumPy conversions (and the reverse) were a significant speed bottleneck in our algorithms, we decided to consistently use NumPy to represent the data in our system.
We still require the PyTorch library since we use it for certain things, such as multiprocessing. However, please use NumPy operations for any vector and matrix operations whenever possible. If you think you cannot work with NumPy and need to use Torch, consider opening an RFC first to increase the chances of your PR being merged.
Another reason we discourage using PyTorch is to add a barrier for deep-learning to creep into Monty. Although we don't have a fundamental issue with contributors using deep learning, we worry that it will be the first thing someone's mind goes to when solving a problem (when you have a hammer...). We want contributors to think intentionally about whether deep-learning is the best solution for what they want to solve. Monty relies on very different principles than those most ML practitioners are used to, and so it is useful to think outside of the mental framework of deep-learning. More importantly, evidence that the brain can perform the long-range weight transport required by deep-learning's cornerstone algorithm - back-propagation - is extremely scarce. We are developing a system that, like the mammalian brain, should be able to use local learning signals to rapidly update representations, while also remaining robust under conditions of continual learning. As a general rule therefore, please avoid PyTorch, and the algorithm that it is usually leveraged to support - back-propagation!
You can read more about our views on deep learning in Monty in our FAQ.
All source code files must have a copyright and license header. The header must be placed at the top of the file, on the first line, before any other code. For example, in Python:
# Copyright <YEARS> Thousand Brains Project
#
# Copyright may exist in Contributors' modifications
# and/or contributions to the work.
#
# Use of this source code is governed by the MIT
# license that can be found in the LICENSE file or at
# https://opensource.org/licenses/MIT.The <YEARS> is the year of the file's creation, and an optional sequence or range of years if the file has been modified over time. For example, if a file was created in 2024 and not modified again, the first line of the header should be # Copyright 2024 Thousand Brains Project. If the file has been modified in consecutive years between 2022 and 2024, the header should be # Copyright 2022-2024 Thousand Brains Project. If the file has been modified in multiple non-consecutive years in 2022, then in 2024 and 2025, the header should be # Copyright 2022,2024-2025 Thousand Brains Project.
In other words, if you are creating a new file, add the copyright and license header with the current year. If you are modifying an existing file and the header does not include the current year, then add the current year to the header. You should never need to modify anything aside from the year in the very first line of the header.
Note
While we deeply value and appreciate every contribution, the source code file header is reserved for essential copyright and license information and will not be used for contributor acknowledgments.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14, RFC 2119, RFC 8174 when, and only when, they appear in all capitals, as shown here.
This guidance does not dictate the only way to implement functionality. There are many ways to implement any particular functionality, each of which will work. This guidance establishes constraints so that as more functionality is implemented and functionality is changed, it remains as easy as it was with the first piece of functionality.
Please note that there are differences between research and platform requirements. While research needs speed and agility, the platform needs modularity and stability. These are different and can conflict. The guidance here is for the platform. If you are a researcher, you MAY ignore this guidance in your prototype and code in the way most effective for you and your task. Later, if your prototype works and needs to be integrated into Monty, then we will refactor the prototype to correspond to the guidance here.
Why? We want to move to Protocols eventually. Keeping the abstract classes free of implementation makes it easier to transition to Protocols in the future. Whereas, having an abstract class with some implementation requires additional refactoring when we transition to Protocols.
Why do we want to move to Protocols eventually? We want to catch errors as early as possible, and using Protocols allows us to do this at type check time. Using abstract classes delays this until class instantiation, once the program runs.
# ABCs raises errors during instantiation when/if constructor is called:
class Monty(ABC):
def implemented(self):
pass
@abstractmethod
def unimplemented(self):
pass
class DefaultMonty(Monty):
pass
def invoke(monty: Monty):
monty.unimplemented()
monty = DefaultMonty() # runtime error & fails type check
invoke(monty) # OK, no type error
# ---
# Typical inheritance raises errors during runtime when monty.unimplemented() is called:
class Monty:
def implemented(self):
pass
def unimplemented(self):
raise NotImplementedError
class DefaultMonty(Monty):
pass
def invoke(monty: Monty):
monty.unimplemented() # runtime error
monty = DefaultMonty() # OK, no type error
invoke(monty) # OK, no type error
# ---
# Protocols raise errors during type check when attempting use:
class MontyProtocol(Protocol):
def implemented(self): ...
def unimplemented(self): ...
class DefaultMonty:
def implemented(self):
pass
def invoke(monty: MontyProtocol):
monty.unimplemented() # runtime error
monty = DefaultMonty() # OK
other: MontyProtocol = DefaultMonty() # fails type check
invoke(monty) # fails type checkWhile abstract classes MAY be used, you SHOULD prefer Protocols.
Protocols document a behaves-like-a relationship.
Why: We want to catch errors as early as possible, and using Protocols allows us to do this at type check time. Using abstract classes delays this until class instantiation, once the program runs.
There is no material difference in the context of usage and expectation documentation between using Protocols and abstract classes. In other contexts, Protocols are favorable because they allow us to raise errors at type check time and, due to structural typing, do not require inheritance.
# Protocols raise errors during type check when attempting use:
class MontyProtocol(Protocol):
def implemented(self): ...
def unimplemented(self): ...
class DefaultMonty:
def implemented(self):
pass
def invoke(monty: MontyProtocol):
monty.unimplemented() # runtime error
monty = DefaultMonty() # OK
other: MontyProtocol = DefaultMonty() # fails type check
invoke(monty) # fails type checkWhy: Inheritance hierarchy allows for overriding methods. As class hierarchies deepen, override analysis becomes more complex. The issue is not how the code functions but the difficulty of reasoning about behavior when multiple layers of overrides are possible. The deeper the hierarchy, the more difficult it is to track what code a specific instance uses, and it makes it unclear where functionality should be overridden. Modifying code with a deep inheritance hierarchy is also complex, in that any change can have cascading effects up and down the hierarchy.
Most of the time, you should default to not using inheritance hierarchy, and instead, reach for other ways to assemble functionality. Inheritance is appropriate for an is-a relationship, but this is quite a rare occurrence in practice. A lot of things seem like they form an is-a relationship, but the odds of that relationship being maintained drop off dramatically as the code evolves and the hierarchy deepens.
class Rectangle:
def __init__(self, length: float, height: float) -> None:
super().__init__() # See "You SHOULD always include call to super().__init__() ..." section below
self._length = length
self._height = height
@property
def area(self) -> float:
return self._length * self._height
class Square(Rectangle):
def __init__(self, side: float) -> None:
super().__init__(side, side)
# So far so good...
# The next day, we want to add resize functionality
class Rectangle:
# --unchanged code omitted-
def resize(self, new_length: float, new_height: float) -> None:
self._length = new_length
self._height = new_height
# But, now this no longer makes sense for the Square
sq = Square(5)
sq.resize(5,3) # ?!As depicted in the example above, if we assume an is-a relationship as the default and reach for inheritance, we can very rapidly introduce functionality that violates the is-a relationship requirement.
Using composition by default instead:
# We want to reuse the area calculating functionality, hence this class
class DefaultAreaComputer:
@staticmethod
def area(length: float, height: float) -> float:
return length * height
class Rectangle:
def __init__(self, length: float, height: float) -> None:
super().__init__()
self._length = length
self._height = height
self._area_computer = DefaultAreaComputer
@property
def area(self) -> float:
return self._area_computer.area(self._length, self._height)
class Square:
def __init__(self, side: float) -> None:
super().__init__()
self._side = side
self._area_computer = DefaultAreaComputer
@property
def area(self) -> float:
return self._area_computer.area(self._side, self._side)
# Now, we want to implement resize for Rectangle
class Rectangle:
# --unchanged code omitted--
def resize(self, new_length: float, new_height: float) -> None:
self._length = new_length
self._height = new_height
# No issues, because we never assumed is-a relationship in the first place.What if we now want to replace the DefaultAreaComputer with a different implementation?
# Implement a different computer
class TooComplicatedAreaComputer:
@staticmethod
def area(length: float, height: float) -> float:
return 4 * (length / 2) * (height / 2)
# Use new computer in Rectangle
class Rectangle:
def __init__(self, length: float, height: float) -> None:
super().__init__()
self._length = length
self._height = height
self._area_computer = TooComplicatedAreaComputer
# --unchanged code omitted--What if we want to make the area computer configurable?
# Define the protocol
class AreaComputer(Protocol):
@staticmethod
def area(length: float, height: float) -> float: ...
# Update Rectangle to accept area computer
class Rectangle:
def __init__(
self,
length: float,
height: float,
area_computer: type[AreaComputer] = TooComplicatedAreaComputer
) -> None:
super().__init__()
self._length = length
self._height = height
self._area_computer = area_computer
# --unchanged code omitted--If we want our code to change rapidly, to try out different ideas, and to configure existing code with these variants, using modular components for functionality reuse instead of inheritance allows for changes to remain small in scope without affecting unrelated functionality up and down the inheritance chain.
Bare Functions or Static Methods SHOULD Be Used to Share a Functionality Implementation That Does Not Access Instance State
Why: Do not require state that you don’t access. Functions without state are vastly easier to reuse, refactor, reason about, and test.
# calculating an area
class DefaultAreaComputer:
@staticmethod
def area(length: float, height: float) -> float:
return length * height
# alternatively
def area(length: float, height: float) -> float:
return length * heightA reason to use a static method on a class over a bare function would be when we want to pass the functionality to another class. This is because we want our configurations to be serializable, and a type is serializable in a more straightforward manner than a Callable would be. For example:
# Static method approach
class Rectangle:
def __init__(
self,
length: float,
height: float,
area_computer: type[AreaComputer] = DefaultAreaComputer # Easier to serialize
) -> None:
super().__init__()
self._length = length
self._height = height
self._area_computer = area_computer
@property
def area(self) -> float:
return self._area_computer.area(self._length, self._height)
# Bare function approach
class Rectangle:
def __init__(
self,
length: float,
height: float,
area_computer: Callable[[float, float], float] = area # More challenging to serialize
) -> None:
super().__init__()
self._length = length
self._height = height
self._area_computer = area_computer
@property
def area(self) -> float:
return self._area_computer(self._length, self._height)To Share a Functionality Implementation That Reads the State of the Instance Being Mixed With, Mixins MAY Be Used
For sharing functionality, mixins only implement a shared behaves-like-a functionality. They add functionality, however, mixins SHALL NOT add state to the instance being mixed with. Every time you find yourself in need of state when working on a mixin, switch to composition instead.
Why: When Mixins do not add state, they are not terrible for implementing shared functionality. That’s why you MAY use them for this. However, when Mixins add state, you must look at two places for the state to understand the implementation instead of one. Having to look in two places is an example of incidental complexity, where it is not inherent to the problem being solved. Incidental complexity should be minimized.
# OK, Mixin only reads state
class RectangleAreaMixin:
@property
def area(self) -> float:
return self._length * self._height
class Rectangle(RectangleAreaMixin):
def __init__(self, length: float, height: float) -> None:
super().__init__()
self._length = length
self._height = height
# ---
# Not OK, Mixin adds state
class RectangleAreaMixin:
def __init__(self, length: float, height: float) -> None:
super().__init__()
self._length = length
self._height = height
@property
def area(self) -> float:
return self._length * self.height
class Rectangle(RectangleAreaMixin):
def __init__(self, length: float, height: float) -> None:
super().__init__(length, height)Composition is used to implement a has-a relationship.
Why: Components encapsulate additional state in a single concept in a single place in the code.
Given a Rectangle that uses a DefaultAreaComputer (because we reuse that functionality elsewhere), let’s say we want to count how many times we resized it.
class DefaultAreaComputer:
@staticmethod
def area(length: float, height: float) -> float:
return length * height
class Rectangle:
def __init__(self, length: float, height: float) -> None:
super().__init__()
self._length = length
self._height = height
self._area_computer = DefaultAreaComputer
self._resize_count = 0 # We track count in Rectangle state
@property
def area(self) -> float:
return self._area_computer.area(self._length, self._height)
def resize(self, new_length: float, new_height: float) -> None:
self._length = new_length
self._height = new_height
self._resize_count += 1 # We update internal state
@property
def resize_count(self) -> int:
return self._resize_count
# Now, we want to reuse the count functionality
# First, we extract/encapsulate the DefaultCounter functionality
class DefaultCounter:
def __init__(self) -> None:
self._count = 0
def increment(self) -> None:
self._count += 1
@property
def count(self) -> int:
return self._count
# We then update Rectangle to use the shared functionality
class Rectangle:
def __init__(self, length: float, height: float) -> None:
super().__init__()
self._length = length
self._height = height
self._area_computer = DefaultAreaComputer
# Note that the count itself (state) is no longer in the Rectangle
self._resize_counter = DefaultCounter() # We track count in DefaultCounter
# --unchanged code omitted--
def resize(self, new_length: float, new_height: float) -> None:
self._length = new_length
self._height = new_height
self._resize_counter.increment() # We update the count
@property
def resize_count(self) -> int:
return self._resize_counter.count
# And now that we extracted the DefaultCounter functionality, we can use it elsewhere
class Circle:
def __init__(self, radius: float) -> None:
super().__init__()
self._radius = radius
# Note that DefaultCounter() introduces new state, but it is
# encapsulated within the component
self._resize_counter = DefaultCounter()
@property
def area(self) -> float:
# We don't need to make everything a component.
# Since we don't reuse circle area functionality anywhere,
# it is OK to have it inline here.
return math.pi * self._radius ** 2
def resize(self, new_radius: float) -> None:
self._radius = radius
self._resize_counter.increment()
@property
def resize_count(self) -> int:
return self._resize_counter.countWhy: This avoids possible issues with multiple inheritance by opting into “cooperative multiple inheritance.”
This should not be an issue once all of our code follows this guidance document, specifically ensuring that Mixins do not introduce state, making Mixins with __init__ unlikely. However, it may be a while before we get there, so this guidance is included.
See https://eugeneyan.com/writing/uncommon-python/#using-super-in-base-classes for additional details, but here are some examples with their corresponding output:
Note
print call occurs after the call to super().__init__(). Output would be in different order if print occurred before the super().__init__() call.
# Correct and expected
class Parent:
def __init__(self) -> None:
super().__init__()
print("Parent init")
class Mixin:
pass
class Child(Mixin, Parent):
def __init__(self) -> None:
super().__init__()
print("Child init")
child = Child()
# Output
# > Parent init
# > Child init
# Also correct and expected
class Parent:
def __init__(self) -> None:
super().__init__()
print("Parent init")
class Mixin:
pass
class Child(Parent, Mixin):
def __init__(self) -> None:
super().__init__()
print("Child init")
child = Child()
# Output
# > Parent init
# > Child initThe problems begin when inherited classes all have __init__ defined.
# Correct and expected
class Parent:
def __init__(self) -> None:
super().__init__()
print("Parent init")
class Mixin:
def __init__(self) -> None:
super().__init__()
print("Mixin init")
class Child(Mixin, Parent):
def __init__(self) -> None:
super().__init__()
print("Child init")
child = Child()
# Output
# > Parent init
# > Mixin init
# > Child init
# Also correct and expected
class Parent:
def __init__(self) -> None:
super().__init__()
print("Parent init")
class Mixin:
def __init__(self) -> None:
super().__init__()
print("Mixin init")
class Child(Parent, Mixin):
def __init__(self) -> None:
super().__init__()
print("Child init")
child = Child()
# Output
# > Mixin init
# > Parent init
# > Child init
# If you skip super().__init__() call in one of the inherited classes, some class __init__ methods are skipped
# class Child(Mixin, Parent) where we skip super().__init__() in Mixin
class Parent:
def __init__(self) -> None:
super().__init__()
print("Parent init")
class Mixin:
def __init__(self) -> None:
# super().__init__() skipped
print("Mixin init")
class Child(Mixin, Parent):
def __init__(self) -> None:
super().__init__()
print("Child init")
child = Child()
# Output
# > Mixin init
# > Child init
# class Child(Parent, Mixin) where we skip super().__init__() in Parent
class Parent:
def __init__(self) -> None:
# super().__init__() skipped
print("Parent init")
class Mixin:
def __init__(self) -> None:
super().__init__()
print("Mixin init")
class Child(Parent, Mixin):
def __init__(self) -> None:
super().__init__()
print("Child init")
child = Child()
# Output
# > Parent init
# > Child initPython initially introduced type hinting in version 3.5 after being defined in PEP 484. The Python interpreter itself generally ignores type hints, so the main way in which they are used is via IDE/LSP support, and through type checking tools like mypy. The thinking was for this system to be optional. Any code that would run before would continue to run after the introduction of type hints. It is also in some ways gradual, in that the entire codebase doesn't have to be fully type hinted before some degree of benefit can be realized.
In the future, we would like to utilize type hints in the Monty codebase. However, there are multiple approaches that can be taken to do this, so we want to provide some guidance to ensure we're adding type hints that are giving us the most benefits.
Python already has a dynamic type system. Variables don't have types which restrict the values that can be assigned to them. The values themselves have types that determine what can be done with them, however any check of those types occur at runtime throwing errors when those checks fail.
Using type hints with a type checker implements a static type system. A static type system assigns types to variables, restricting which values can be assigned to them and which operations can be used on those variables and checking those types ahead of time without running the code.
The main benefit of a static type checker enforcing constraints is that it can prevent errors by checking ahead of time for method calls or operations on a variable that would otherwise throw an error at runtime. Type hints also document what arguments are allowed in methods and what types they return. They can also encode logic into the type system to allow for proving certain properties of the code.
One thing to be aware of, the Python interpreter DOES NOT care about type hints. Anything you could normally do in Python without type hints will still be possible at runtime with type hints. The only value they add is when a type checker is used to confirm that the operations being performed match what the type hints indicate is allowed.
This especially applies to newtypes (see below for details). Newtypes do nothing at runtime. There’s a slight performance hit for the call to the “constructor” of the newtype, but it’s negligible. It is, like the rest of type hinting, just a hint, that the Python interpreter ignores, but a type checker can use to help ensure correctness.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14, RFC 2119, and RFC 8174 when, and only when, they appear in all capitals, as shown here.
The type assigned to arguments should be as abstract as possible when specifying the arguments to a method or function. For example, if a method needs some collection of items, instead of specifying the type as a List, use Iterable instead.
# Don't restrict the argument to a List
def double_list(l: List[int]) -> List[int]:
return [x * 2 for x in l]
# Instead, use an appropriate collection type
def double_coll(c: Iterable[int]) -> List[int]:
return [x * 2 for x in c]Using the broadest type that provides the functionality needed allows for more flexibility when calling the function or method. The second example can be called with a set, while the first only works for lists.
The type returned from a method or function should be the most concrete type possible. Similar to the previous example, the return type could be Collection[int], but that type wouldn't allow for calling list-specific methods on the returned value, even though we're returning a list.
Returning the most concrete type gives more flexibility to the caller to use that value in ways that the function/method author might not have considered.
Structural typing is a type system in which the structure of the type is what matters for type checking. Two values with the same structure will type-check as the same type, regardless of any type aliases being used (see the Python glossary for a stricter definition of structural types).
Basic types like str, list, and dict[int, str] are examples of structural types in Python. Nothing about a dict[int, str] indicates what those keys or values represent, and any dictionary that takes integers as keys and has strings as values would type-check against that type.
Type aliases are structural types; they do not lead to nominal types. From the type checker's perspective, they are replaced with the type that they alias. They are only a convenience for not having to write out long structural types.
While Protocols provide structural subtyping, they are an exception to this guidance and SHOULD be used for static duck typing (see the section on duck typing for details).
Structural types SHOULD NOT be used when possible because they don't define the concepts that the types represent, reducing their usefulness.
# Type alias for a quaternion
# This doesn't define a new type, it just allows the function
# definition below to be shorter.
QuaternionWXYZ = Tuple[float, float, float, float]
# Define another one for a different order
QuaternionXYZW = Tuple[float, float, float, float]
def normalize_quaternion(quat: QuaternionWXYZ):
# Based on the type alias, the function expects this
w, x, y, z = quat
return normalized_quaterion
quat: QuaternionXYZW = (0.0, 0.0, 0.0, 1.0)
norm = normalize_quaternion(quat) # This type checks, but gives invalid resultsAlternative ways to model a quaternion are using newtypes or dataclasses, both forms of nominal types (see the section on nominal typing for details on the benefits of nominal types). Which to choose would depend on whether additional functionality is needed, or for easier compatibility with third-party libraries.
@dataclass
class Quaternion:
w: float
x: float
y: float
z: float
# Or
QuaternionWXYZ = NewType("QuaternionWXYZ", Tuple[float, float, float, float])The dataclass approach forces the author to specify which coefficient they want to access, removing ambiguity, while the newtype still requires the author to be careful about which float in the tuple is the one they want, but the name helps indicate which it is. The newtype also prevents passing a raw tuple where a QuaternionWXYZ is expected without explicitly turning it into one.
An exception to this guideline would be the types of internal fields of a class, like the floats in the Quaternion dataclass in the previous example. Oftentimes those don’t need to be anything more specific than the underlying structural type. An example might be a name field on a class, which could simply be a str. Unless there is some extra metadata, e.g. the str is untrusted user input that should be handled carefully, a simple str will suffice. This can also apply to things like lists, sets, and dictionaries.
A rule of thumb in most cases would be to ask whether the type needs to be exposed outside of the class or function in which the variable is defined. Function or method arguments and return types should generally avoid structural types in favor of nominal types.
Nominal typing is a type system in which the name of the type is what matters for type checking. Two values with the same structure but different names are distinct types in this system (see the Python glossary for a stricter definition of nominal type).
More complex types, like dataclasses and regular classes, are examples of nominal types in Python. Two dataclasses with the same fields but different names would type-check as different types. Classes and dataclasses (which are a convenience for defining certain kinds of classes) are more opaque than anything to do with the fields or methods they have. Different classes are different types and the only way instances of those classes can type check as each other is if one is a subclass of another (this is the subtype polymorphism behaviour provided by object-oriented languages).
Python also provides NewTypes which can be used to define nominal types for basic types that would normally be structurally typed. An example of a newtype would be to define the concept of radians and degrees for angles in a system, and ensure that they can't be used in the wrong places.
Radians = NewType("Radians", float)
Degrees = NewType("Degrees", float)
# Both Radians and Degrees are floats, but they cannot be swapped.
def rads_to_degrees(rads: Radians) -> Degrees:
pass
right_angle = Radians(Math.pi / 4)
another_angle = Degrees(180.0)
rads_to_degrees(right_angle) # works fine
# FAILS to typecheck: "Degrees" is not "Radians", etc.
rads_to_degrees(another_angle)
# FAILS to typecheck: "float" is not "Radians", etc.
rads_to_degrees(270.0)Another example would be to codify the order of a quaternion (since there is disagreement between libraries whether to use WXYZ or XYZW) and then check usages (see the quaternion example in the section on structural typing).
Newtypes should be used when no additional functionality beyond the underlying type is needed, but metadata about the type needs to be tracked. An example of this would be unsafe and safe strings in a web application. They shouldn’t be confused because they can lead to security vulnerabilities, but with newtypes we can help the author to think about which is being used.
UnsafeString = NewType("UnsafeString", str)
SafeString = NewType("SafeString", str)
def render_with_content(template: SafeString, content: SafeString):
# This isn't the best way to do this, but it's an example
return format(template, content)
def sanitize_string(s: UnsafeString) -> SafeString:
# do some sanitization
return SafeString(new_s)
unsafe_input: UnsafeString # comes from some user input
template = SafeString("Hello, {}")
# FAILS to typecheck: "UnsafeString" is not "SafeString"
render_with_content(template, unsafe_input)
safe_input = sanitize_string(unsafe_input)
# FAILS to typecheck: "SafeString" is not "UnsafeString"
safe_input = sanitize_string(safe_input)In this example, it becomes more difficult to accidentally pass an unsafe string to a rendering function that could cause security issues, and it’s also difficult to accidentally sanitize a safe string a second time.
Duck typing is a form of structural typing in dynamically typed languages where two different value types can be substituted for each other if they have the same methods. The name comes from the duck test. “If it walks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.”
Python’s type hinting library provides Protocols to allow defining the shared interface that types can implement in order to be considered the same type by a type checker. They allow defining interfaces that other types use without requiring those types to inherit from a base class to do it. A Protocol can be implemented by types elsewhere without those types even knowing about the Protocol’s existence.
Protocols allow for defining abstract interfaces that types can satisfy to allow functions and methods to accept the broadest possible types. The abstract collection types, e.g. Collection, Iterable, etc., in the standard library can be thought of like protocols that various concrete collection types implement. This can be seen in the first example above.
The Any type in Python causes the type checker to stop attempting to do type checking, since it has no information whatsoever to go on. This means we lose all the benefits of using a static type checker, and thus it SHOULD NOT be used.
Another similar type is object. While it can be used with arbitrary typed values similar to Any, it is an ordinary static type so the type checker will reject most operations on it (i.e., those not defined for all objects), and so it MAY be used where appropriate (see the Mypy documentation on Any vs. object for more details).
Some third-party libraries, especially ones that are extensions written in another language like C, provide poor type hinting, meaning the majority of the benefits that come from static type checking aren't available. To minimize this surface area, we should isolate code that uses these third-party libraries, wrapping them in functions or methods that provide the correct types that the rest of our code expects.
Even in libraries like NumPy that purport to have typing hints available, sometimes those are less useful. For example, a lot of functions return npt.NDArray[Any] when we know for certain that the type should be npt.NDArray[np.float64]. In cases like these, we want to isolate the chaos and return the types we want.
def do_maths(input: ...) -> npt.NDArray[np.float64]:
# do lots of Numpy calls
return resultThis might require the use of typing.cast() or explicit type hints on variable declarations to satisfy the type checker.
The granularity SHOULD NOT be individual NumPy functions, but rather whole operations in our code where we are using multiple NumPy functions in a row. We're not trying to make a wrapper for poorly typed libraries, but making sure when we return things into our code, store values on objects, etc. that we're giving them useful types.
# Don't do something this granular
def mujoco_worldbody(spec: MjSpec) -> MjsBody:
return spec.worldbody # this might require `cast(MjsBody, spec.worldbody)`
def some_other_method(...) -> None:
...
worldbody = mujoco_worldbody(spec)
...
# Instead, just declare the types in the other method where needed
def some_other_method(...) -> MjsBody:
...
worldbody: MjsBody = spec.worldbody
# More MuJoCo calls on worldbody that may need type hints
...This would also be a good place to use newtypes to define input argument types like quaternion tuples that have a particular order so that callers don't pass the wrong values into these external library functions. See the nominal type guidance above.
Before we can accept your contribution, you must sign the Contributor License Agreement (CLA). You can view and sign the CLA here.
The Pull Request flow begins when you Create a Pull Request to our GitHub repository.
A Maintainer will check the Pull Request for validity. If it is invalid, it will be Rejected. The Maintainer should include an explanation for why the Pull Request has been rejected.
A valid Pull Request is on-topic, well-formatted, and contains expected information.
A Maintainer will apply the appropriate labels (triaged at a minimum) as part of Triage.
A Maintainer may also assign the Pull Request to one or more Maintainers for Review by adding them to the "Assignees" list.
Note
Maintainers
When assigning the Pull Request for Review, please keep in mind that all assigned Maintainers must Approve the Pull Request before it can be Merged.
Use the "Assignees" list to assign the Pull Request to a Maintainer. The "Reviewers" list may include others that contributed to reviewing the Pull Request.
If your Pull Request has been Rejected and you want to resubmit it, you should start over and Create a new Pull Request with appropriate changes. Depending on the reason for rejection, it might be good to submit an RFC first.
Automated checks run against the Pull Request as part of the Review. If the automated checks fail, you should Update the Pull Request.
If all automated checks pass, Maintainers will Review the Pull Request. If changes are needed, Maintainers will Request Changes.
If automated checks fail or Maintainers Request Changes, you are responsible for Updating the Pull Request. Once you have updated it, the Pull Request will again enter Review.
Once all assigned Maintainers (listed in the "Assignees" list) Approve your Pull Request, your Pull Request is eligible to be Merged.
If the automated checks fail while your Pull Request is Approved, it is your responsibility to Update the Pull Request.
Maintainers may delay Merging your Pull Request to accommodate any optional commits implementing non-blocking suggestions from the Review (these are depicted as Suggested Commits in the flow diagram).
Once your Pull Request is Approved, if you make any unexpected commits that do not implement non-blocking suggestions from the Review (these are depicted as Unexpected Commits in the flow diagram), Maintainers may Withdraw Approval and the Pull Request will need to be Reviewed again.
Caution
Maintainers
The reason for you to Withdraw Approval would be to communicate the need for additional Review after unexpected changes. If you do not Withdraw Approval, other Maintainers may interpret the Pull Request as Approved and merge it.
Caution
Maintainers
All assigned Maintainers must Approve the Pull Request before it can be Merged.
As of this writing (October 2024), GitHub does not provide a mechanism to enforce this.
Maintainers will Merge your Approved Pull Request.
Note
Maintainers
The commit message for the merge commit should comply with RFC 10 Conventional Commits.
We use the following commit types:
fix: Fix to a bug in the src/tbp/monty codebase. This correlates withPATCHin RFC 7 Monty versioning.feat: Introduction of a new feature to the scr/tbp/monty codebase. This correlates withMINORin RFC 7 Monty versioning.build: Change to the build system or external dependencies.ci: Change to our GitHub Actions configuration files and scripts.docs: Documentation only update.perf: Performance improvement.refactor: A src/tbp/monty code change that neither fixes a bug nor adds a feature.style: Change that does not affect the meaning of the code (white-space, formatting, etc.).test: Adding or correcting tests.chore: The commit is a catch all for work outside of the types identified above. For example, the commit affects infrastructure, tooling, development, or other non-Monty framework code.rfc: RFC proposal.revert: Commit that reverts a previous commit.
Breaking changes are communicated by appending ! after the type. This correlates with MAJOR in RFC 7 Monty versioning.
Note
Maintainers
Verify that Co-authored-by headers added by GitHub to the commit message are correct. Sometimes, when you merge the main branch into a Pull Request, GitHub will automatically add you as a co-author of that Pull Request. As this is not what we consider authorship, please ensure you remove any Co-authored-by headers of this nature.
Leave any legitimate Co-authored-by headers in place, e.g., from commits cherry-picked into the Pull Request.
After Merge, automated post-merge checks and tasks will run. If these fail, the Pull Request will be Reverted. If they succeed, you are Done 🥳🎉🎊.
If your Pull Request is Reverted, you should start over and Create Pull Request with the appropriate changes.
Thank you for reviewing a Pull Request. Below are some guidelines to consider.
Before a Pull Request is reviewed, it should pass the automated checks. If the automated checks fail, you may want to wait until the author makes the required updates.
For a Pull Request to be Merged, it must be Approved by at least one Maintainer, and the Pre-merge checks must pass. See Pull Request Flow for additional details.
If you are not a Maintainer, you may still review the Pull Request and provide insights. While unable to Approve the Pull Request, you still can influence what sort of code is included in Monty.
Multiple aspects must be considered when conducting a Pull Request Review. Generally, Maintainers should favor approving a Pull Request if it improves things along one or several of the dimensions listed below.
Consider the overall design of the change. Does it make sense? Does it belong in Monty? Is now the time for this functionality? Does it fit with the design patterns within Monty?
Does the code do what the author intended? Are the changes good for both end-users and developers? Is there a demo available so that you can evaluate the functionality by experiencing it instead of reading the code?
Does the code perform as the author intended? Is it an improvement over current performance on our benchmarks (or at least no degradation)?
Complex code is more difficult to debug. The code should be as simple as possible, but not simpler.
If applicable, ask for unit, integration, or end-to-end tests. The tests should fail when the code is broken. Complex tests are more difficult to debug. The tests should be as simple as possible, but not simpler.
If applicable, ask for benchmarks. Existing benchmarks should not worsen with the change.
A good name is unambiguously telling the reader what the variable, class, or function is for.
Does the code follow the Style Guide? Prefer automated style formatting.
Are the comments necessary? Usually, comments should explain why, not the what or how.
Note that comments are distinct from documentation in the code, such as class or method descriptions.
Is the change sufficiently documented? Can a user understand the new code or feature without any other background knowledge (like things discussed in Pull Request review comments or meetings)? Does every class and function have appropriate doc strings explaining their purpose, inputs and outputs?
Note: code documentation can also be too verbose. If docstrings are getting too long, consider adding a new page to overall documentation instead. Comments don't need to explain every line of code.
Generally, look at every line of the Pull Request. If asked to review only a portion of the Pull Request, comment on what was reviewed.
If you have difficulty reading the code, others will also have difficulty; ask for clarification. Approach each pull request with fresh eyes and consider whether other uses will understand the changes. Just because you understand something (maybe because you talked about it in another conversation or you spent a lot of time thinking about the change) doesn't mean that others will. The code should be intuitive to understand and easy to read.
If you feel unqualified to review some of the code, ask someone else to review that portion of the code.
"Zoom out." Consider the change in the broader context of the file or the system as a whole. Should the changes now include refactoring or re-architecture?
Does the pull request have an appropriate scope? Each pull request should address one specific problem or add one specific feature. If there are multiple additions in a pull request, consider asking the contributor to split them into separate pull requests.
If you see things you like, let the author know. This keeps the review from being fully focused on mistakes and invites future contributions. Celebrate the awesome work of this individual.
Remember the people involved in the Pull Request. Be welcoming. If there is a conflict or coming to an agreement is difficult, changing the medium from Pull Request comments to audio or video can help. Sometimes, complex topics can be more easily discussed with a real-time meeting, so don't hesitate to suggest a time when everyone can meet synchronously.
Be cordial and polite in your review. A Pull Request is a gift and should be treated as such. Assume the best of intentions in submitted Pull Requests, even if that Pull Request is eventually rejected.
Feel free to Request Changes on a Pull Request. Once changes are requested, it is up to the author to Update the Pull Request, at which point the Pull Request will again be Reviewed. See Pull Request Flow for additional details.
We encourage including the prefix nit: for suggestions or changes that are minor and wouldn’t prevent you from approving the Pull Request. This helps distinguish nitpicks from essential, blocking requests.
A Pull Request only requires Approval from one Maintainer.
Once Approved, before the Pull Request is Merged, pre-merge checks must pass. See Pull Request Flow for additional details.
Please see the instructions here if you would like to tackle one of these tasks.
<style> .contribution-button { border-radius: 5px; background-color: #EFEFEF; padding-left: 10px; padding-right: 10px; padding-top: 5px; padding-bottom: 5px; color: black; weight: bolder; display: inline-flex; margin-top: -10px; align-items: center; border: 1px solid #CCCCCC; text-decoration: none !important; } .contribution-button:hover { background-color: #DDDDDD; cursor: pointer; color: black; } </style>All our docs are open-source. If something is wrong or unclear, submit a PR to fix it!
Learn how to contribute to our docs
small medium large unknown
infrastructure community-engagement
documentation
css html javascript s3 content marketing github-actions github-readme-sync-tool interviewing podcasting python research thousand-brains-theory video-editing writing content-creation
completed in-progress
accuracy pose numsteps speed noise learning multi-object generalization compositional deformations features-and-morphology scale real-world dynamic goal-policy abstract adversarial transfer infrastructure oss
We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.
Examples of behavior that contributes to a positive environment for our community include:
- Demonstrating empathy and kindness toward other people
- Being respectful of differing opinions, viewpoints, and experiences
- Giving and gracefully accepting constructive feedback
- Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience
- Focusing on what is best not just for us as individuals, but for the overall community
Examples of unacceptable behavior include:
- The use of sexualized language or imagery, and sexual attention or advances of any kind
- Trolling, insulting or derogatory comments, and personal or political attacks
- Public or private harassment
- Publishing others' private information, such as a physical or email address, without their explicit permission
- Other conduct which could reasonably be considered inappropriate in a professional setting
Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.
Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned with this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.
This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official email address, posting via an official social media account, or acting as an appointed representative at an online or offline event.
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement by emailing [email protected] All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the reporter of any incident.
Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:
Community Impact: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.
Consequence: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.
Community Impact: A violation through a single incident or series of actions.
Consequence: For the person in violation, a warning is sent with consequences for continued behavior and no further contact with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.
Community Impact: A serious violation of community standards, including sustained inappropriate behavior.
Consequence: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.
Community Impact: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.
Consequence: A permanent ban from any sort of public interaction within the community.
This Code of Conduct is adapted from the Contributor Covenant, version 2.1, available at
https://www.contributor-covenant.org/version/2/1/code_of_conduct.html.
Community Impact Guidelines were inspired by Mozilla's code of conduct enforcement ladder.
For answers to common questions about this code of conduct, see the FAQ at https://www.contributor-covenant.org/faq. Translations are available at https://www.contributor-covenant.org/translations.
Thousand Brains Project adopts a hierarchical technical governance structure for Monty.
- A community of Contributors who file issues, make pull requests, and contribute to the project.
- A small set of Maintainers, who drive the overall project direction.
- The Maintainers have a Lead Maintainer, the catch-all decision maker.
All Maintainers are expected to have a strong bias towards the Thousand Brains Project and Monty's design philosophy.
Contributors are encouraged to participate by filing issues, making pull requests, and contributing to the project.
The current Lead Maintainer is listed in the MAINTAINERS file.
- Make the final decision when Maintainers cannot reach a consensus or make a decision.
- Confirm or deny adding new or removing current Maintainers.
The current Maintainers are listed in the MAINTAINERS file.
- Maintain Monty's design philosophy, technical direction, and conventions.
- Engage with the community and Contributors.
- Triage, review, and comment on issues.
- Triage and comment on pull requests.
- Review and merge assigned pull requests.
- Serve as the primary point of contact for assigned Requests For Comments (RFCs).
- Propose disposition of RFCs (e.g., merge or close).
- Approve or abstain from the disposition of RFCs.
- Any Maintainer can nominate a new Maintainer by starting a private email thread amongst all the current Maintainers with the nomination and proposing that the nominee be invited to join.
- The Maintainers are given sufficient time to respond to the nomination. If there is disagreement, a discussion ensues, possibly resulting in a vote.
- The Lead Maintainer confirms or denies the nomination.
- Any Maintainer can propose to remove a current Maintainer by starting a private email thread amongst all the current Maintainers, excluding the person proposed to be removed.
- The Maintainers are given sufficient time to respond to the removal proposal. If there is disagreement, a discussion ensues, possibly resulting in a vote.
- The Lead Maintainer confirms or denies the removal.
- Any Maintainer can resign by contacting the Lead Maintainer privately.
- The Lead Maintainer is selected by the Executive Director of the Thousand Brains Project.
One of the goals of the Thousand Brains Project is to create a new form of AI that benefits the world to the greatest extent possible. To accomplish this goal of broad adoption, we have chosen a permissive license and Numenta has created a non-assertion pledge that covers the patents related to the Thousand Brains Project. These two actions, we believe, will encourage researchers, universities, the private sector, and industry specialists to help accelerate our community's progress in building impressive new capabilities to serve the world.
We have chosen the standard MIT License. This permissive license places few restrictions on reuse and permits commercial use. It is simple, concise, and compatible with many other licenses.
Numenta holds several patents that were created as we unraveled the complexities of the human neocortex and how these principles can be applied to machine learning. However, Numenta's non-assertion pledge means we will only assert these patents defensively to protect the open-source community. You are free to use all of the technology in the open-source project to create value without worrying about Numenta's patents.
- Numenta, Inc. will not enforce the listed patents against any party, provided that the party or its affiliates do not assert or profit from assertion of any patents against Numenta, Inc., its affiliates or its licensees.
- In the event that the party initiates a patent infringement lawsuit against Numenta, Inc., its affiliates or its licensees, Numenta, Inc. reserves the right to pursue all available legal remedies, including but not limited to, asserting counterclaims and defenses, based on the listed patents.
- This pledge grants no warranties of any kind, either express or implied, as to the patents, and all such warranties are expressly disclaimed.
- This pledge shall be governed by and construed under the laws of California.
This page showcases some projects that were realized using the Monty code-base. If you have a project that you would like to see featured here, simply create a PR adding it to this page.
Please make sure your project is well documented, including a README on how to run it and ideally some images or video showcasing it. Feel free to also include a video or image here. Please also keep your description on this page short and concise.
[block:embed] { "html": "<iframe class="embedly-embed" src="//cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FKcE004QbuSw%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DKcE004QbuSw&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FKcE004QbuSw%2Fhqdefault.jpg&type=text%2Fhtml&schema=youtube" width="640" height="480" scrolling="no" title="YouTube embed" frameborder="0" allow="autoplay; fullscreen; encrypted-media; picture-in-picture;" allowfullscreen="true"></iframe>", "url": "https://www.youtube.com/watch?v=KcE004QbuSw", "title": "2023/03 - Monty's First Live Demo in the Real World", "favicon": "https://www.youtube.com/favicon.ico", "image": "https://i.ytimg.com/vi/KcE004QbuSw/hqdefault.jpg", "provider": "https://www.youtube.com/", "href": "https://www.youtube.com/watch?v=KcE004QbuSw", "typeOfEmbed": "youtube" } [/block]
This is the first real-world demo of Monty the TBP team came up with. We used the iPad camera to take an image of an object. Monty then moves a small patch over this image and tries to recognize the object.
See the monty_lab project folder for more details.
The first example of Monty moving its sensors in the real-world.
Follow the LEGO tutorial to try this out yourself.
See the everything_is_awesome repository for more information.
Watch the video:
[block:embed] { "html": "<iframe class="embedly-embed" src="//cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2F_u7STtACQ50%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D_u7STtACQ50&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2F_u7STtACQ50%2Fhqdefault.jpg&type=text%2Fhtml&schema=youtube" width="640" height="480" scrolling="no" title="YouTube embed" frameborder="0" allow="autoplay; fullscreen; encrypted-media; picture-in-picture;" allowfullscreen="true"></iframe>", "url": "https://www.youtube.com/watch?v=_u7STtACQ50", "title": "2025/05 - Robot Hackathon Presentations", "favicon": "https://www.youtube.com/favicon.ico", "image": "https://i.ytimg.com/vi/_u7STtACQ50/hqdefault.jpg", "provider": "https://www.youtube.com/", "href": "https://www.youtube.com/watch?v=_u7STtACQ50", "typeOfEmbed": "youtube" } [/block]
Using sensorimotor AI to guide ultrasound.
Follow the ultrasound tutorial for more details.
See the ultrasound_perception repository for more information.
Watch the video:
[block:embed] { "html": "<iframe class="embedly-embed" src="//cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2F-zrq0oTJudo%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D-zrq0oTJudo&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2F-zrq0oTJudo%2Fhqdefault.jpg&type=text%2Fhtml&schema=youtube" width="854" height="480" scrolling="no" title="YouTube embed" frameborder="0" allow="autoplay; fullscreen; encrypted-media; picture-in-picture;" allowfullscreen="true"></iframe>", "url": "https://www.youtube.com/watch?v=-zrq0oTJudo", "title": "2025-05 Ultrasound Presentation and Demo", "favicon": "https://www.youtube.com/favicon.ico", "image": "https://i.ytimg.com/vi/-zrq0oTJudo/hqdefault.jpg", "provider": "https://www.youtube.com/", "href": "https://www.youtube.com/watch?v=-zrq0oTJudo", "typeOfEmbed": "youtube" } [/block]
- Nair, H., Leyman, W., Sampath, A., Jacobson, Q., & Shen, J. P. (2024). NeRTCAM: CAM-Based CMOS Implementation of Reference Frames for Neuromorphic Processors. ArXiv. /abs/2405.11844 https://arxiv.org/abs/2405.11844
- Hole, K.J. Tool-Augmented Human Creativity. Minds & Machines 34, 16 (2024). https://doi.org/10.1007/s11023-024-09677-x
For a list of our papers see here.
If you would like to add your paper to this list, simply open a PR editing this document. Please only add papers here that are explicitly based on the TBT or the Monty code base, not papers that briefly cite it. Please use APA style for the citation.
Monty is designed for sensorimotor applications. It is not designed to learn from static datasets like many current AI systems are, and so it may not be a drop-in replacement for an existing AI use case you have. Our system is made to learn and interact using similar principles to the human brain, the most intelligent and flexible system known to exist. The brain is a sensorimotor system that constantly receives movement input and produces movement outputs. In fact, the only outputs from the brain are for movement. Our system works the same way. It needs to receive information about how the sensors connected to it are moving in space in order to learn structured models of whatever it is sensing. The inputs are not just a bag of features, but instead features at poses (location + orientation). Have a look at our challenging preconceptions page for more details.
⚠️ This is a Research Project Note that Monty is still a research project, not a full-fledged platform that you can just plug into your existing infrastructure. Although we are actively working on making Monty easy to use and are excited about people who want to test our approach in their applications, we want to be clear that we are not offering an out-of-the-box solution at the moment. There are many capabilities that are still on our research roadmap and the tbp.monty code base is still in major version zero, meaning that our public API is not stable and could change at any time.
Any application where you have moving sensors is a potential application for Monty. This could be physical movement of sensors on a robot. It could also be simulated movement such as our simulations in Habitat or the sensor patch cropped out of a larger 2D image in the Monty meets world experiments. It could also be movement through conceptual space or another non-physical space such as navigating the internet.
Applications where we anticipate Monty to particularly shine are:
- Applications where little data to learn from is available
- Applications where little compute to train is available
- Applications where little compute to do inference is available (like on edge devices)
- Applications where no supervised data is available
- Applications where continual learning and fast adaptation is required
- Applications where the agent needs to generalize/extrapolate to new tasks
- Applications where interpretability is important
- Applications where robustness is important (to noise but also samples outside of the training distribution)
- Applications where multimodal integration is required or multimodal transfer (learning with one modality and inferring with another)
- Applications where the agent needs to solve a wide range of tasks
- Applications where humans do well but current AI does not
To get an idea of how Monty could be used in the future, you can have a look at this video where we go over how key milestones on our research roadmap will unlock more and more applications. Of course, there is no way to anticipate the future and there will likely be many applications we are not thinking of today. But this video might give you a general idea of the types of applications a Thousand Brains System will excel in. [block:embed] { "html": "<iframe class="embedly-embed" src="//cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FIap_sq1_BzE%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DIap_sq1_BzE&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FIap_sq1_BzE%2Fhqdefault.jpg&type=text%2Fhtml&schema=youtube" width="854" height="480" scrolling="no" title="YouTube embed" frameborder="0" allow="autoplay; fullscreen; encrypted-media; picture-in-picture;" allowfullscreen="true"></iframe>", "url": "https://www.youtube.com/watch?v=Iap_sq1_BzE", "title": "2025/04 - TBP Future Applications & Positive Impacts", "favicon": "https://www.youtube.com/favicon.ico", "image": "https://i.ytimg.com/vi/Iap_sq1_BzE/hqdefault.jpg", "provider": "https://www.youtube.com/", "href": "https://www.youtube.com/watch?v=Iap_sq1_BzE", "typeOfEmbed": "youtube" } [/block]
For a list of current and future capabilities, see Capabilities of the System. For experiments (and their results) measuring the current capabilities of the Monty implementation, see our Benchmark Experiments.
There are three major components that play a role in the architecture: sensors, learning modules, and actuators [1]. These three components are tied together by a common messaging protocol, which we call the cortical messaging protocol (CMP). Due to the unified messaging protocol, the inner workings of each individual component can be quite varied as long as they have the appropriate interfaces [2].
Those three components and the CMP are described in the following sub-sections. For a presentation of all the content in this sections (+a few others), have a look at the recording from our launch symposium:
[block:embed] { "html": "<iframe class="embedly-embed" src="//cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FlqFZKlsb8Dc%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DlqFZKlsb8Dc&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FlqFZKlsb8Dc%2Fhqdefault.jpg&type=text%2Fhtml&schema=youtube" width="854" height="480" scrolling="no" title="YouTube embed" frameborder="0" allow="autoplay; fullscreen; encrypted-media; picture-in-picture;" allowfullscreen="true"></iframe>", "url": "https://www.youtube.com/watch?v=lqFZKlsb8Dc", "title": "2024/12 Overview of the TBP and the Monty Implementation", "favicon": "https://www.youtube.com/favicon.ico", "image": "https://i.ytimg.com/vi/lqFZKlsb8Dc/hqdefault.jpg", "provider": "https://www.youtube.com/", "href": "https://www.youtube.com/watch?v=lqFZKlsb8Dc", "typeOfEmbed": "youtube" } [/block]
1: Sensors may be actuators and could have capabilities to take a motor command to move or attend to a new location.
2: In general, the learning modules in an instance of Monty will adhere to the concepts described herein, however it is possible to augment Monty with alternative learning modules. For example, we do not anticipate that the learning modules described herein will be useful for calculating the result of numerical functions, or for predicting the structure of a protein given its genetic sequence. Alternative systems could therefore be leveraged for such tasks and then interfaced according to the CMP.
title: Benchmark Experiments description: Performance of current implementation on our benchmark test suite.
These benchmark experiments are not common benchmarks from the AI field. There are a set of experiments we have defined for ourselves to track our research progress. They specifically evaluate capabilities that we have added or plan to add to Monty.
You can find Monty experiment configs for all the following experiments in the benchmarks folder. Note that the experiment parameters are not overly optimized for accuracy. The parameters used here aim to strike a good balance between speed and accuracy to allow our researchers to iterate quickly and evaluate algorithm changes regularly. If a particular use case requires higher accuracy or faster learning or inference, this can be achieved by adjusting learning module parameters.
If you want to evaluate Monty on external benchmarks, please have a look at our application criteria and challenging preconceptions pages first. Particularly, note that Monty is a sensorimotor system made to efficiently learn and infer by interacting with an environment. It is not designed for large, static datasets.
We split up the experiments into a short benchmark test suite and a long one. The short suite tests performance on a subset of 10 out of the 77 YCB objects which allows us to assess performance under different conditions more quickly. Unless otherwise indicated, the 10 objects are chosen to be distinct in morphology and models are learned using the surface agent, which follows the object surface much like a finger.
When building the graph we add a new point if it differs in location by more than 1cm from other points already learned, or its features are different from physically nearby learned points (a difference of 0.1 for hue and 1 for log curvature). Experiments using these models have 10distobj in their name.
To be able to test the ability to distinguish similar objects (for example by using more sophisticated policies) we also have a test set of 10 similar objects (shown below) learned in the same way. Experiments using these models have 10simobj in their name.
For experiments with multiple sensors and learning modules we currently only have a setup with the distant agent so we also have to train the models with the distant agent. These models have less even coverage of points since we just see the objects from several fixed viewpoints and can't move as freely around the object as we can with the surface agent. This is why these models have a few missing areas where parts of the object were never visible during training. In the 5LM experiments, each LM has learned slightly different models, depending on their sensor parameters. The image below shows the models learned in one LM. Results with one LM for comparability are given in the experiment marked with dist_on_distm (i.e., distant agent evaluated on distant-agent trained models).
Configs with multi in the name have additional distractor objects, in addition to the primary target object. These experiments are designed to evaluate the model's ability to stay on an object until it is recognized. As a result, these are currently setup so that the agent should always begin on a "primary" target object, and recognition of this object is the primary metric that we evaluate. In addition however, there is a "step-wise" target, which is whatever object an LM's sensor is currently viewing; the ultimate MLH or converged ID of an LM is therefore also compared to the step-wise target that the LM was observing at the time. To make recognition of the primary target relatively challenging, distractor objects are added as close as possible along the horizontal axis, while ensuring that i) the objects do not clip into each other, and ii) that an initial view of the primary target is achieved at the start of the episode. Note these experiments cannot currently be run with multi-processing (the -m flag), as the Object Initializer classes need to be updated. Example multi-object environments are shown below.
Configs with _dist_agent in the name use the distant agent for inference (by default they still use the models learned with the surface agent). This means that the sensor is fixed in one location and can only tilt up, down, left, and right following a random walk. When using the model-based hypothesis-testing policy, the agent can also "jump" to new locations in space. Configs with surf_agent in the name use the surface agent for inference which can freely move around the entire object. Both the surface and the distant agent can execute model-based actions using the hypothesis testing policy. For more details, see our documentation on policies.
Configs with base in their name test each object in the 14 orientations in which they were learned. No noise is added to the sensor.
Configs with randrot in their name test each object in 10 random, new rotations (different rotations for each object).
Configs with noise in their name test with noisy sensor modules where we add Gaussian noise to the sensed locations (0.002), surface normals (2), curvature directions (2), log curvatures (0.1), pose_fully_defined (0.01), and hue (0.1). Numbers in brackets are the standard deviations used for sampling the noisy observations. Note that the learned models were acquired without sensor noise. The image below should visualize how much location noise we get during inference but the LM still contains the noiseless models shown above.
Configs with rawnoise in the name test with noisy raw sensor input where Gaussian noise is applied directly to the depth image which is used for location, surface normal, and curvature estimation. Here we use a standard deviation of 0.001. This allows us to test the noise robustness of the sensor module compared to testing the noise robustness of the learning module in the noise experiments.
Note that all benchmark experiments were performed with the total least-squares regression implementation for computing the surface normals, and the distance-weighted quadratic regression for the principal curvatures (with their default parameters).
The following results are obtained from experiments using the 10-object subsets of the YCB dataset described above. base configs test with all 14 known rotations (10 objects * 14 rotations each = 140 episodes), and randrot configs test with 10 random rotations (10 objects * 10 rotation each = 100 episodes). All experiments were run on 16 CPUs with parallelization except for base_10multi_distinctobj_dist_agent; this experiment must be run without parallelization.
!table[../../benchmarks/results/ycb_10objs.csv]
The following results are obtained from experiments on the entire YCB dataset (77 objects). Since this means having 77 instead of 10 objects in memory, having to disambiguate between them, and running 77 episodes instead of 10 per epoch, these runs take significantly longer. Due to that we only test 3 known rotations ([0, 0, 0], [0, 90, 0], [0, 180, 0]) for the base configs and 3 random rotations for the randrot configs. The 5LM experiment is currently just run with 1 epoch (1 random rotation per object) but might be extended to 3. The 5LM experiment is run on 48 CPUs instead of 16.
!table[../../benchmarks/results/ycb_77objs.csv]
-
Why does the distant agent do worse than the surface agent? The distant agent has limited capabilities to move along the object. In particular, the distant agent currently uses an almost random policy which is not as efficient and informative as the surface agent which follows the principal curvatures of the object. Note however that both the distant and surface agent can now move around the object using the hypothesis-testing action policy, and so the difference in performance between the two is not as great as it previously was.
-
Why is the distant agent on the distant agent models worse than on the surface agent model? As you can see in the figure above, the models learned with distant agent have several blind spots and unevenly sampled areas. When we test random rotations we may see the object from views that are underrepresented in the object model. If we use a 10% threshold instead of 20% we can actually get a little better performance with the distant agent since we allow it to converge faster. This may be because it gets less time to move into badly represented areas and because it reaches the time-out condition less often.
-
Why is the accuracy on distinct objects higher than on similar objects? Since we need to be able to deal with noise, it can happen that objects that are similar to each other get confused. In particular, objects that only differ in some specific locations (like the fork and the spoon) can be difficult to distinguish if the policy doesn't efficiently move to the distinguishable features and if there is noise.
-
Why is raw sensor noise so much worse than the standard noise condition? This is not related to the capabilities of the learning module but to the sensor module. Currently, our surface normal and principal curvature estimates are not implemented to be very robust to sensor noise such that noise in the depth image can distort the surface normal by more than 70 degrees. We don't want our learning module to be robust to this much noise in the surface normals but instead want the sensor module to communicate better features. We already added some improvements on our surface normal estimates which helped a lot on the raw noise experiment.
-
Why do the distant agent experiments take longer and have more episodes where the most likely hypothesis is used? Since the distant agent policy is less efficient in how it explores a given view (random walk of tilting the camera), we take more steps to converge with the distant agent or sometimes do not resolve the object at all (this is when we reach a time-out and use the MLH). If we have to take more steps for each episode, the runtime also increases.
-
Why is the run time for 77 objects longer than for 10? For one, we run more episodes per epoch (77 instead of 10) so each epoch will take longer. However, in the current benchmark, we test with fewer rotations (only 3 epochs instead of 14 or 10 epochs in the shorter experiments). Therefore the main factor here is that the number of evidence updates we need to perform at each step scales linearly with the number of objects an LM has in its memory (going down over time as we remove objects from our hypothesis space). Additionally, we need to take more steps to distinguish 77 objects than to distinguish 10 (especially if the 10 objects are distinct).
In general, we want to be able to dynamically learn and infer instead of having a clear-cut separation between supervised pre-training followed by inference. We also want to be able to learn unsupervised. This is tested in the following experiment using the surface agent. We test the same 10 objects set as above with 10 fixed rotations. In the first epoch, each object should be recognized as new (no_match) leading to the creation of a new graph. The following episodes should correctly recognize the object and add new points to the existing graphs. Since we do not provide labels it can happen that one object is recognized as another one and then their graphs are merged. This can especially happen with similar objects but ideally, their graphs are still aligned well because of the pose recognition. It can also happen that one object is represented using multiple graphs if it was not recognized. Those scenarios are tracked with the mean_objects_per_graph and mean_graphs_per_object statistics.
An object is classified as detected correctly if the detected object ID is in the list of objects used for building the graph. This means, if a model was built from multiple objects, there are multiple correct classifications for this model. For example, if we learned a graph from a tomato can and later merge points from a peach can into the same graph, then this graph would be the correct label for tomato and peach cans in the future. This is also why the experiment with similar objects reaches a higher accuracy after the first epoch. Since in the first epoch we build fewer graphs than we saw objects (representing similar objects in the same model) it makes it easier later to recognize these combined models since, for this accuracy measure, we do not need to distinguish the similar objects anymore if they are represented in the same graph. In the most extreme case, if during the first epoch, all objects were merged into a single graph, then the following epochs would get 100% accuracy. As such, future work emphasizing unsupervised learning will also require more fine-grained metrics, such as a dataset with hierarchical labels that appropriately distinguish specific instances (peach-can vs tomato-can), from general ones (cans or even just "cylindrical objects").
!table[../../benchmarks/results/ycb_unsupervised.csv]
To obtain these results use print_unsupervised_stats(train_stats, epoch_len=10) (wandb logging is currently not written for unsupervised stats). Unsupervised, continual learning, by definition, cannot be parallelized across epochs. Therefore these experiments were run without multiprocessing (using run.py) on the laptop (running on cloud CPUs works as well but since these are slower without parallelization these were run on the laptop).
Most benchmark experiments assume a clean separation between objects, and a clearly defined episode structure — where each episode corresponds to a single object, and resets allow Monty to reinitialize its internal states. However, in real-world settings, such boundaries don't exist. Objects may be swapped, occluded, or even combined (e.g., a logo on a mug), and an agent must continuously perceive and adapt without external signals indicating when or whether an object has changed. This capability is essential for scaling to dynamic, real-world environments where compositionality, occlusion, and object transitions are the norm rather than the exception.
To simulate such a scenario, we designed an experimental setup that swaps the current object without resetting Monty's internal state. The goal is to test whether Monty can correctly abandon the old hypothesis and begin accumulating evidence on the new object — all without any explicit supervisory signal or internal reset. Unlike typical episodes where Monty’s internal state — including its learning modules, sensory modules, buffers, and hypothesis space — is reinitialized at object boundaries, here the model must dynamically adapt based solely on its stream of observations and internal evidence updates.
More specifically, these experiments are run purely in evaluation mode (i.e., pre-trained object graphs are loaded before the experiment begins) with no training or graph updates taking place. Monty stays in the matching phase, continuously updating its internal hypotheses based on sensory observations. For each object, the model performs a fixed number of matching steps before the object is swapped. At the end of each segment, we evaluate whether Monty’s most likely hypothesis correctly identifies the current object. All experiments are performed on 10 distinct objects from the YCB dataset and 10 random rotations for each object. Random noise is added to sensory observations.
!table[../../benchmarks/results/ycb_unsupervised_inference.csv]
Warning
These benchmark experiments track the progress on RFC 9: Hypotheses resampling.
We do not expect these experiments to have good performance until the RFC is implemented and issue #214 is resolved.
These experiments are currently run without multiprocessing (using run.py).
The following experiments evaluate Monty's ability to learn and infer compositional objects, where these consist of simple 3D objects (a disk, a cube, a cylinder, a sphere, and a mug) with 2D logos on their surface. The logos are either the TBP logo or the Numenta logo. In the dataset, the logos can be in a standard orientation on the object, or oriented vertically. Finally, there is an instance of the mug with the TBP logo bent half-way along the logo at 45 degrees.
We want to determine the ability of a Monty system with a hierarchy of LMs (here, a single low-level LM sending input to a single high-level LM) to build compositional models of these kinds of objects. To enable learning such models, we provide some amount of supervision to the LMs. The low and high-level LMs begin by learning the 3D objects and logos in isolation, as standalone objects. These are referred to as object "parts" in the configs. We then present Monty the compositional objects, while the low-level LM is set to perform unsupervised inference. Any object IDs it detects to the high level LM. The high level LM continues learning, and is provided with a supervised label for the compositional object (e.g. 024_mug_tbp_horz).
To measure performance, we introduced two new metrics:
consistent_child_obj, which measures when a learning module detects an object within the set of plausible children objects. For example, the consistent child objects formug_tbp_horzwould bemugandtbp_logo. We use this since the lower level LM doesn't have the compositional model and we have no ability, e.g. a semantic sensor, to know which part it was sensing.mlh_prediction_error, which measures how closely the prediction of the most likely hypothesis matches the current input.
!table[../../benchmarks/results/logos_on_objects.csv]
Warning
These benchmarks are not currently expected to have good performance and are used to track our research progress for compositional datasets.
Note: To obtain these results, pretraining was run without parallelization across episodes, inference was run with parallelization.
Note
You can download the data here:
| Dataset | Archive Format | Download Link |
|---|---|---|
| compositional_objects | tgz | compositional_objects.tgz |
| compositional_objects | zip | compositional_objects.zip |
Unpack the archive in the ~/tbp/data/ folder. For example:
mkdir -p ~/tbp/data/
cd ~/tbp/data/
curl -L https://tbp-data-public-5e789bd48e75350c.s3.us-east-2.amazonaws.com/tbp.monty/compositional_objects.tgz | tar -xzf -
mkdir -p ~/tbp/data/
cd ~/tbp/data/
curl -O https://tbp-data-public-5e789bd48e75350c.s3.us-east-2.amazonaws.com/tbp.monty/compositional_objects.zip
unzip compositional_objects.zip
To generate the pretrained models, run the experiments in benchmarks/configs/learn_compositional_objects.py in the order in which they are listed by running:
python benchmarks/run.py -e experiment_name
The following experiments evaluate a Monty model on real-world images derived from the RGBD camera of an iPad/iPhone device. The models that the Monty system leverages are based on photogrammetry scans of the same objects in the real world, and Monty learns on these in the simulated Habitat environment; this approach is taken because currently, we cannot track the movements of the iPad through space, and so Monty cannot leverage its typical sensorimotor learning to build the internal models.
For a really cool video of the first time Monty was tested in the real world, see the recording linked on our project showcase page.
These experiments have been designed to evaluate Monty's robustness to real-world data, and in this particular case, its ability to generalize from simulation to the real-world. In the world_image experiments, the model is evaluated on the aforementioned iPad extracted images, while in the randrot_noise_sim_on_scan_monty_world experiment, we evaluate the model in simulation at inference time, albeit with some noise added and with the distant agent fixed to a single location (i.e., no hypothesis-testing policy). This enables a reasonable evaluation of the sim-to-real change in performance. Furthermore, the world_image experiments are intended to capture a variety of possible adversarial settings.
The dataset itself consists of 12 objects, with some representing multiple instances of similar objects (e.g. the Numenta mug vs the terracotta mug, or the hot sauce bottle vs the cocktail bitters bottle). Each one of the world_image datasets contains 4 different views of each of these objects, for a total of 48 views for each dataset, or 240 views across all 5 real-world settings. The experimental conditions are i) standard (no adversarial modifications), ii) dark (low-lighting), iii) bright, iv) hand intrusion (a hand is significantly encircling and thereby occluding parts of the object), and v) multi-object (the first 2/4 images are the object paired with a similar object next to it, and the latter 2/4 images are the object paired with a structurally different object).
You can download the data:
| Dataset | Archive Format | Download Link |
|---|---|---|
| worldimages | tgz | worldimages.tgz |
| worldimages | zip | worldimages.zip |
Unpack the archive in the ~/tbp/data/ folder. For example:
mkdir -p ~/tbp/data/
cd ~/tbp/data/
curl -L https://tbp-data-public-5e789bd48e75350c.s3.us-east-2.amazonaws.com/tbp.monty/worldimages.tgz | tar -xzf -
mkdir -p ~/tbp/data/
cd ~/tbp/data/
curl -O https://tbp-data-public-5e789bd48e75350c.s3.us-east-2.amazonaws.com/tbp.monty/worldimages.zip
unzip worldimages.zip
Finally, note that the world_image experimental runs do not support running with multi-processing, so you cannot use the run_parallel.py script when running these. This is because an appropriate object_init_sampler has yet to be defined for this experimental setup. All experiments are run with 16 CPUs for benchmarking purposes.
See the monty_lab project folder for the code.
Note
The randrot_noise_sim_on_scan_monty_world experiment requires HabitatSim to be installed as well as additional data containing the meshes for the simulator to use.
You can download the data:
| Dataset | Archive Format | Download Link |
|---|---|---|
| numenta_lab | tgz | numenta_lab.tgz |
| numenta_lab | zip | numenta_lab.zip |
Unpack the archive in the ~/tbp/data/ folder. For example:
mkdir -p ~/tbp/data/
cd ~/tbp/data/
curl -L https://tbp-data-public-5e789bd48e75350c.s3.us-east-2.amazonaws.com/tbp.monty/numenta_lab.tgz | tar -xzf -
mkdir -p ~/tbp/data/
cd ~/tbp/data/
curl -O https://tbp-data-public-5e789bd48e75350c.s3.us-east-2.amazonaws.com/tbp.monty/numenta_lab.zip
unzip numenta_lab.zip
!table[../../benchmarks/results/montymeetsworld.csv]
Note that rotation errors are meaningless since no ground truth rotation is provided
- Why is there such a drop in performance going from sim-to-real?
Although there are likely a variety of factors at play, it is worth emphasizing that it is highly non trivial to learn on photogrammetry scanned objects, and then generalize to images extracted from an RGBD depth camera, as this is a significant shift in the source of data available to the model. Furthermore, note that due to structural similarity of several of the objects in the dataset to one another, it is not too surprising that the model may converge to an incorrect hypothesis given only a single view. - Why is the percent MLH used so high?
Each episode is restricted to a single viewing angle of an object, resulting in significant ambiguity. Furthermore, episodes use a maximum of 100 matching steps so that these experiments can be run quickly. - Are there any other factors contributing to performance differences to be aware of?
During the collection of some of the datasets, the "smoothing" setting was unfortunately not active; this affects the standard (world_image_on_scanned_model) dataset, as well as the bright and hand-intrusion experiments. Broadly, this appears to not have had too much of an impact, given that e.g., dark and bright perform comparably (with bright actually being better, even though it was acquired without smoothing). There appear to be a couple of images (around 5 out of the 240 images), where this has resulted in a large step-change in the depth reading, and as a result of this, the experiment begins with the model "off" the object, even though to a human eye, the initial position of the patch is clearly on the object. This will be addressed in a future update to the data-sets, where we can also implement any additional changes we may wish to make during data collection (e.g., more control of object poses, or the inclusion of motor data). - What steps should be noted when acquiring new images?
In addition to ensuring that the "smoothing" option is toggled on (currently off by default), lie the iPad on its side, ensuring that the volume bottom is at the top, so that the orientation of images are consistent across the data-sets. In general, objects should be as close to the camera as possible when taking images, while ensuring the depth values do not begin to clip.
In the future, we will expand this test suite to cover more capabilities such as more multiple object scenarios (touching vs. occluding objects), compositional objects, object categories, distorted objects, different features on the same morphology, objects in different states, object behaviors, and abstract concepts.
We aim to build a general-purpose system that is not optimized for one specific application. Think of it as similar to Artificial Neural Networks (ANNs), which can be applied to all kinds of different problems and are a general tool for modeling data. Our system will be aimed at modeling sensorimotor data, not static datasets. This means the input to the system should contain sensor and motor information. The system then models this data and can output motor commands.
The most natural application is robotics, with physical sensors and actuators. However, the system should also be able to generalize to more abstract sensorimotor setups, such as navigating the web or conceptual space. As another example, reading and producing language can be framed as a sensorimotor task where the sensor moves through the sentence space, and action outputs could produce the letters of the alphabet in a meaningful sequence. Due to the messaging protocol between the sensor and learning module, the system can effortlessly integrate multiple modalities and even ground language in physical models learned through other senses like vision or touch.
For more details, see Application Criteria and Capabilities of the System.
Much of today's AI is based on learning from giant datasets by training ANNs on vast clusters of GPUs. This not only burns a lot of energy and requires a large dataset, but it is also fundamentally different from how humans learn. We know that there is a more efficient way of learning; we do it every day with our brain, which uses about as much energy as a light bulb. So why not use what we know about the brain to make AI more efficient and robust?
This project is an ambitious endeavor to rethink AI from the ground up. We know there is a lot of hype around LLMs, and we believe that they will remain useful tools in the future, but they are not as efficient nor as capable as the neocortex. In the Thousand Brains Project, we want to build an open-source platform that will catalyze a new type of AI. This AI learns continuously and efficiently through active interaction with the world, just like children do.
We aim to make our conceptual progress available quickly by publishing recordings of all our research meetings on YouTube. Any engineering progress will automatically be available as part of our open-source code base. We keep track of our system's capabilities by frequently running a suite of benchmark experiments and evaluating the effectiveness of any new features we introduce. The results of these will also be visible in our GitHub repository.
In addition to making all incremental progress visible, we will publish more succinct write-ups of our progress and results at academic conferences and in peer-reviewed journals. We also plan to produce more condensed informational content through a podcast and YouTube videos.
No, LLMs are incredibly useful and powerful for various applications. However, we believe that the current approach most researchers and companies employ of incrementally adding small features to ANNs/LLMs will lead to diminishing returns over the next few years. Developing genuine human-like intelligence demands a bold, innovative departure from the norm. As we rethink AI from the ground up, we anticipate a longer period of initial investment with little return that will eventually compound to unlock potential that is unreachable with today’s solutions. We believe that this more human-like artificial intelligence will be what people think of when asked about AI in the future. At the same time, we see LLMs as a tool that will continue to be useful for specific problems, much like the calculator is today.
The system we are building in the Thousand Brains Project has many advantages over current popular approaches. For one, it is much more energy and data-efficient. It can learn faster and from less data than deep learning approaches. This means that it can learn from higher-quality data and be deployed in applications where data is scarce. It can also continually add new knowledge to its models without forgetting old knowledge. The system is always learning, actively testing hypotheses, and improving its current models of whatever environment it is learning in.
Another advantage is the system's scalability and modularity. Due to the modular and general structure of the learning and sensor modules, one can build extremely flexible architectures tailored to an application's specific needs. A small application may only require a single learning module to model it, while a large and complex application could use thousands of learning modules and even stack them hierarchically. The cortical messaging protocol makes multimodal integration possible and effortless.
Using reference frames for modeling allows for easier generalization, more robust representations, and more interpretability. To sum it up, the system is good at all the things that humans are good at, but current AI is not.
The TBP and HTM are both based on years of neuroscience research at Numenta and other labs across the world. They both implement principles we learned from the neocortex in code to build intelligent machines. However, they are entirely separate implementations and differ in which principles they focus on. While HTM focuses more on the lower-level computational principles such as sparse distributed representations (SDR), biologically plausible learning rules, and sequence memory, the TBP focuses more on the higher-level principles such as sensorimotor learning, the cortical column as a general and repeatable modeling unit, and models structured by reference frames.
In the TBP, we are building a general framework based on the principles of the thousand brains theory. We have sensor modules that convert raw sensor data into a common messaging protocol and learning modules, which are general, sensorimotor modeling units that can get input from any sensor module or learning module. Importantly, all communication within the system involves movement information, and models learned within the LMs incorporate this motion information into their reference frames.
There can be many types of learning modules as long as they adhere to the messaging protocol and can model sensorimotor data. This means there could be a learning module that uses HTM (with some mechanism to handle the movement data, such as grid cells). However, the learning modules do not need to use HTM, and we usually don't use HTM-based modules in our current implementation.
We are excited that you’re interested in this project, and we want to build an active open-source community around it. There are different ways you can get involved. If you are an engineer or researcher with ideas on improving our implementation, we would be delighted to have you contribute to our code base or documentation. Check out details on ways to contribute here.
Second, if you have a specific sensorimotor task you are trying to solve, we would love for you to try our approach. We will work on making an easy-to-use SDK so you can just plug in your sensors and actuators, and our system does the modeling for you. If you would like to see some examples of how other people used our code in their projects, check out our project showcase.
We will start hosting regular research and industry workshops and would be happy to have you join.
Follow our meetup group for updates on upcoming events.
We are also planning to host a series of invited speakers again, so please let us know if you have research that you would like to present and discuss with us. Also, if you have ideas for potential collaborations, feel free to reach out to us at [email protected].
The Thousand Brains project and our research continues to be funded by Jeff Hawkins and now also in part by the Gates Foundation. Our funding is focused on fundamental research into this new technology but will also facilitate exchanges with related research groups and potential applications.
From our inception, Numenta has had two goals: first, to understand the human brain and how it creates intelligence, and second, to apply these principles to enable true machine intelligence. The Thousand Brains Project is aligned with both of these goals and adds a third goal to make the technology accessible and widely adopted.
Numenta has developed a framework for understanding what the neocortex does and how it does it, called the Thousand Brains Theory of Intelligence. It is a collaborative open-source framework dedicated to creating a new type of artificial intelligence that pushes the current boundaries of AI. Numenta’s goals for the Thousand Brains Project are to build an open-source platform for intelligent sensorimotor systems and to be a catalyst for a whole new way of thinking about machine intelligence.
title: Further Reading description: Here we put a list of books and papers that might be interesting for you if you are interested in this project.
(alphabetically)
[block:image] { "images": [ { "image": [ "https://files.readme.io/3b156ef3ddf318743205f47472e4c91420c0083ba2b6db75ab9eb7285b477361-image.png", null, "" ], "align": "left", "sizing": "150px" } ] } [/block]
A Thousand Brains: A New Theory of Intelligence
by Jeff Hawkins
[block:image] { "images": [ { "image": [ "https://files.readme.io/2a073c43091330ce77502fe2fc891c5c6ef97fdf376d04d2eed27247295ad706-image.png", null, "" ], "align": "right", "sizing": "150px" } ] } [/block]
A Brief History of Intelligence: Evolution, AI, and the Five Breakthroughs that Made our Brains
by Max Bennett
[block:image] { "images": [ { "image": [ "https://files.readme.io/6eb0965e920305ad5ef5b0f1a359cc1687008e5809468aed3b4d47a54d64c876-image.png", null, "" ], "align": "left", "sizing": "150px" } ] } [/block]
Dark and Magical Places: The Neuroscience of Navigation
by Christopher Kemp
[block:image] { "images": [ { "image": [ "https://files.readme.io/79cbea8f20a7a9945d87df91a9797bdbfb20c87c4b9563159b19c684f012c50e-image.png", null, "" ], "align": "right", "sizing": "150px" } ] } [/block]
Exploring the Thalamus and Its Role in Cortical Function
by S. Murray Sherman and R. W. Guillery
[block:image] { "images": [ { "image": [ "https://files.readme.io/cf85399b11ecef528e8d3f1f98dcf27ff12d9aef7ca28df488b42f75f394714f-image.png", null, "" ], "align": "left", "sizing": "150px" } ] } [/block]
On Intelligence: How a New Understanding of the Brain Will Lead to the Creation of Truly Intelligent Machines
by Jeff Hawkins and Sandra Blekeslee
[block:image] { "images": [ { "image": [ "https://files.readme.io/c001bf6d4ba8b83c1b512b247b864e8735d2ba24ca565295eb829572305d6c0e-image.png", null, "" ], "align": "right", "sizing": "150px" } ] } [/block]
Perceptual Neuroscience: The Cerebral Cortex
by Vernon B. Mountcastle
[block:image] { "images": [ { "image": [ "https://files.readme.io/9f2734f88c1e6148cf81f8ce416986f9f9ffc8769cc929fd6196e60dab6bd53f-image.png", null, "" ], "align": "left", "sizing": "150px" } ] } [/block]
Toward Human-Level Artificial Intelligence: How Neuroscience Can Inform the Pursuit of Artificial General Intelligence or General AI
by Eitan M. Azoff
(by year)
This is an extremely stripped-down list of the hundreds of papers on which our theory and ideas are based. We tried to collect a few key review articles summarizing important findings that we regularly come back to in our research meetings.
Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex, 1(1), 1–47. https://doi.org/10.1093/cercor/1.1.1
Mountcastle, V. (1997). The columnar organization of the neocortex. Brain, 120(4), 701–722. https://doi.org/10.1093/brain/120.4.701
Thomson, A. (2003). Interlaminar connections in the neocortex. Cerebral Cortex, 13(1), 5–14. https://doi.org/10.1093/cercor/13.1.5
Markram, H., Toledo-Rodriguez, M., Wang, Y., Gupta, A., Silberberg, G., & Wu, C. (2004). Interneurons of the neocortical inhibitory system. Nature Reviews Neuroscience, 5(10), 793–807. https://doi.org/10.1038/nrn1519
Sherman, S. M. (2005). Thalamic relays and cortical functioning. In Progress in Brain Research (Vol. 149, pp. 107–126). Elsevier. https://doi.org/10.1016/S0079-6123(05)49009-3
Hegdé, J., & Felleman, D. J. (2007). Reappraising the functional implications of the primate visual anatomical hierarchy. The Neuroscientist, 13(5), 416–421. https://doi.org/10.1177/1073858407305201
Thomson, A. M. (2007). Functional maps of neocortical local circuitry. Frontiers in Neuroscience, 1(1), 19–42. https://doi.org/10.3389/neuro.01.1.1.002.2007
Thomson, A. (2010). Neocortical layer 6, a review. Frontiers in Neuroanatomy, 4, Article 13. https://doi.org/10.3389/fnana.2010.00013
Sherman, S. M., & Guillery, R. W. (2011). Distinct functions for direct and transthalamic corticocortical connections. Journal of Neurophysiology, 106(3), 1068–1077. https://doi.org/10.1152/jn.00429.2011
Petersen, C. C. H., & Crochet, S. (2013). Synaptic computation and sensory processing in neocortical layer 2/3. Neuron, 78(1), 28–48. https://doi.org/10.1016/j.neuron.2013.03.020
Gu, Y., Lewallen, S., Kinkhabwala, A. A., Domnisoru, C., Yoon, K., Gauthier, J. L., Fiete, I. R., & Tank, D. W. (2018). **A map-like micro-organization of grid cells in the medial entorhinal cortex. **Cell, 175(3), 736–750.e30. https://doi.org/10.1016/j.cell.2018.08.066
Usrey, W. M., & Sherman, S. M. (2019). Corticofugal circuits: Communication lines from the cortex to the rest of the brain. Journal of Comparative Neurology, 527(3), 640–650. https://doi.org/10.1002/cne.24423
Whittington, J. C. R., Muller, T. H., Mark, S., Chen, G., Barry, C., Burgess, N., & Behrens, T. E. J. (2020). **The Tolman-Eichenbaum machine: Unifying space and relational memory through generalization in the hippocampal formation. **Cell, 183(5), 1249–1263. https://doi.org/10.1016/j.cell.2020.10.024
Rao, R. P. N. (2022). A sensory-motor theory of the neocortex based on active predictive coding. bioRxiv. https://doi.org/10.1101/2022.12.30.522267
Suzuki, M., Pennartz, C. M. A., & Aru, J. (2023). How deep is the brain? The shallow brain hypothesis. Nature Reviews Neuroscience. https://doi.org/10.1038/s41583-023-00756-z
Hawkins, J., & Ahmad, S. (2016). Why neurons have thousands of synapses: A theory of sequence memory in neocortex. Frontiers in Neural Circuits, 10, Article 23. https://doi.org/10.3389/fncir.2016.00023
Hawkins, J., Ahmad, S., & Cui, Y. (2017). A theory of how columns in the neocortex enable learning the structure of the world. Frontiers in Neural Circuits, 11, Article 81. https://doi.org/10.3389/fncir.2017.00081
Ahmad, S., & Scheinkman, L. (2019). How can we be so dense? The benefits of using highly sparse representations. arXiv. https://arxiv.org/abs/1903.11257
Hawkins, J., Lewis, M., Klukas, M., Purdy, S., & Ahmad, S. (2019). A framework for intelligence and cortical function based on grid cells in the neocortex. Frontiers in Neural Circuits, 12, Article 121. https://doi.org/10.3389/fncir.2018.00121
Hole, K. J., & Ahmad, S. (2021). A thousand brains: Toward biologically constrained AI. SN Applied Sciences, 3(8), 743. https://doi.org/10.1007/s42452-021-04715-0
Clay, V., Leadholm, N., & Hawkins, J. (2024). The Thousand Brains Project: A New Paradigm for Sensorimotor Intelligence. arXiv. https://arxiv.org/abs/2412.18354#
Leadholm, N., Clay, V., Knudstrup, S., Lee, H., Hawkins, J. (2025) Thousand-Brains Systems: Sensorimotor Intelligence for Rapid, Robust Learning and Inference arXiv. https://arxiv.org/abs/2507.04494
Note
You can read the Thousand-Brains Systems Plain Language Explainer for a less technical overview of the concepts in the above paper.
Hawkins, J., Leadholm, N., Clay, V. (2025) Hierarchy or Heterarchy? A Theory of Long-Range Connections for the Sensorimotor Brain arXiv. https://arxiv.org/abs/2507.05888
Note
You can read the Hierarchy or Heterarchy Plain Language Explainer for a less technical overview of the concepts in the above paper.
As we are working on the neuroscience theory, we are publishing recordings of those meetings on our YouTube channel. Those videos contain our most up-to-date thinking and the questions we are struggling with on a day-to-day basis. If you are curious in the neuroscience behind this project, you can have a look at your brainstorming and review video series. For a short introduction to the project you can have a look at our quick-start series and for a longer overview of key aspects of the project our core video series.
Below is a great introductory video of Jeff Hawkins presenting an overview of our neuroscience theory. [block:embed] { "html": "<iframe class="embedly-embed" src="//cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FowRC8sLSb64%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DowRC8sLSb64&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FowRC8sLSb64%2Fhqdefault.jpg&type=text%2Fhtml&schema=youtube" width="854" height="480" scrolling="no" title="YouTube embed" frameborder="0" allow="autoplay; fullscreen; encrypted-media; picture-in-picture;" allowfullscreen="true"></iframe>", "url": "https://www.youtube.com/watch?v=owRC8sLSb64", "title": "2025/02 - Cortical Circuit Overview", "favicon": "https://www.youtube.com/favicon.ico", "image": "https://i.ytimg.com/vi/owRC8sLSb64/hqdefault.jpg", "provider": "https://www.youtube.com/", "href": "https://www.youtube.com/watch?v=owRC8sLSb64", "typeOfEmbed": "youtube" } [/block]
title: Glossary description: This section aims to provide concise definitions of terms commonly used at the Thousand Brains Project and in Monty.
Implement pattern recognizers to identify patterns such as a specific SDR. One neuron is typically associated with multiple dendrites such that it can identify multiple patterns. In biology, dendrites of a postsynaptic cell receive information from the axons of other presynaptic cells. The axons of these presynaptic cells connect to the dendrites of postsynaptic cells at a junction called a "synapse". An SDR can be thought of as a pattern which is represented by a set of synapses that are collocated on a single dendritic segment.
The spatial difference between two locations. In 3D space, this would be a 3D vector.
A copy of the motor command that was output by the policy and sent to the actuators. This copy can be used by learning modules to update their states or make predictions.
Depending on the environments' state and agents' actions and sensors, the environment returns an observation for each sensor.
Characteristics that can be sensed at a specific location. Features may vary depending on the sensory modality (for example, color in vision but not in touch).
A set of nodes that are connected to each other with edges. Both nodes and edges can have features associated with them. For instance all graphs used in the Monty project have a location associated with each node and a variable list of features. An edge can, for example, have a displacement associated with it.
An assumption that is built into an algorithm/model. If the assumption holds, this can make the model a lot more efficient than without the inductive bias. However, it will cause problems when the assumption does not hold.
A computational unit that takes features at poses as input and uses this information to learn models of the world. It is also able to recognize objects and their poses from the input if an object has been learned already.
In Monty, a model (sometimes referred to as Object Model), is a representation of an object stored entirely within the boundaries of a learning module. The notion of a model in Monty differs from the concept of a deep learning neural network model in several ways:
- A single learning module stores multiple object models in memory, simultaneously.
- The Monty system may have multiple models of the same object if there are multiple learning modules - this is a desired behavior.
- Learning modules update models independently of each other.
- Models are structured using reference frames, not just a bag of features.
- Models represent complete objects, not just parts of objects. These objects can still become subcomponents of compositional objects but are also objects themselves (like the light bulb in a lamp).
A useful analogy is to think of Monty models as CAD representations of objects that exist within the confines of a learning module.
Also see Do Cortical Columns in the Brain Really Model Whole Objects Like a Coffee Mug in V1?
Updating an agent's location by using its own movement and features in the environment.
Defines the function used to select actions. Selected actions can be dependent on a model's internal state and on external inputs.
An object's location and orientation (in a given reference frame). The location can for example be x, y, z coordinates and the orientation can be represented as a quaternion, Euler angles, or a rotation matrix.
A specific coordinate system within which locations and rotations can be represented. For instance, a location may be represented relative to the body (body/ego-centric reference frame) or relative to some point in the world (world/allo-centric reference frame) or relative to an object's center (object-centric reference frame). There is no requirement for a specific origin (for example grid cells in the brain don't represent an origin). The important thing is that locations are represented relative to each other in a consistent metric space with path integration properties. For more information, see our documentation on reference frames in Monty (and transforms between them).
Applies a displacement/translation and a rotation to a set of points. Every point is transformed in the same way such that the overall shape stays the same (i.e., the relative distance between points is fixed).
A computational unit that turns raw sensory input into the cortical messaging protocol. The structure of the output of a sensor module is independent of the sensory modality and represents a list of features at a pose.
Learning or inference through interaction with an environment using a closed loop between action and perception. This means, observations depend on actions and in turn the choice of these actions depend on the observations.
A binary vector with significantly more 0 bits than 1 bits. Significant overlap between the bit assignments in different SDRs captures similarity in representational space (e.g., similar features).
Applies a displacement/translation and a rotation to a point.
Multiple computational units share information about their current state with each other. This can for instance be their current estimate of an object's ID or pose. This information is then used to update each unit's internal state until all units reach a consensus.
We are developing a platform for building AI and robotics applications using the same principles as the human brain. These principles are fundamentally different from those used in deep learning, which is currently the most prevalent form of AI. Therefore, our platform represents an alternate form of AI, one that we believe will play an ever-increasing role in the future.
We call the implementation described herein "Monty", in reference to Vernon Mountcastle, who proposed the columnar organization of the mammalian neocortex. Mountcastle argued that the power of the mammalian brain lies in its re-use of columns as a core computational unit, and this paradigm represents a central component of the Thousand Brains Project (TBP). The Monty implementation is one specific instantiation of the Thousand Brains Theory and the computations of the mammalian neocortex and is implemented in Python. In the future, there may be other implementations as part of this project. The ultimate aim is to enable developers to build AI applications that are more intelligent, more flexible, and more capable than those built using traditional deep learning methods. Monty is our first step towards this goal and represents an open-source, sensorimotor learning framework.
One key differentiator between the TBP and other AI technologies is that the TBP is built with embodied, sensorimotor learning at its core. Sensorimotor systems learn by sensing different parts of the world over time while interacting with it. For example, as you move your body, your limbs, and your eyes, the input to your brain changes. In Monty, the learning derived from continuous interaction with an environment represents the foundational knowledge that supports all other functions. This contrasts with the growing approach that sensorimotor interactions are a sub-problem that can be solved by beginning with an architecture trained on a mixture of internet-scale language and multi-media data. In addition to sensorimotor interaction being the core basis for learning, the centrality of sensorimotor learning manifests in the design choice that all levels of processing are sensorimotor. As will become clear, sensory and motor processing are not broken up and handled by distinct architectures, but play a crucial role at every point in Monty where information is processed.
A second differentiator is that our sensorimotor systems learn structured models, using reference frames, coordinate systems within which locations and rotations can be represented. The models keep track of where their sensors are relative to things in the world. They are learned by assigning sensory observations to locations in reference frames. In this way, the models learned by sensorimotor systems are structured, similar to CAD models in a computer. This allows the system to quickly learn the structure of the world and how to manipulate objects to achieve a variety of goals, what is sometimes referred to as a 'world model'. As with sensorimotor learning, reference frames are used throughout all levels of information processing, including the representations of not only environments, but also physical objects and abstract concepts - even the simplest representations in the proposed architecture are represented within a reference frame.
There are numerous advantages to sensorimotor learning and reference frames. At a high level, you can think about all the ways humans are different from today's AI. We learn quickly and continuously, constantly updating our knowledge of the world as we go about our day. We do not have to undergo a lengthy and expensive training phase to learn something new. We interact with the world and manipulate tools and objects in sophisticated ways that leverage our knowledge of how things are structured. For example, we can explore a new app on our phone and quickly figure out what it does and how it works based on other apps we know. We actively test hypotheses to fill in the gaps in our knowledge. We also learn from multiple sensors and our different sensors work together seamlessly. For example, we may learn what a new tool looks like with a few glances and then immediately know how to grab and interact with the object via touch.
One of the most important discoveries about the brain is that most of what we think of as intelligence, from seeing, to touching, to hearing, to conceptual thinking, to language, is created by a common neural algorithm. All aspects of intelligence are created by the same sensorimotor mechanism. In the neocortex, this mechanism is implemented in each of the thousands of cortical columns. This means we can create many different types of intelligent systems using a set of common building blocks. The architecture we are creating is built on this premise. Monty will provide the core components and developers will then be able to assemble widely varying AI and robotics applications using these components in different numbers and arrangements. Any engineer will be able to create AI applications using the Platform without requiring huge computational resources or background knowledge.
Welcome to the Thousand Brains Project, an open-source framework for sensorimotor learning systems that follow the same principles as the human brain.
This documentation outlines the current features, and future vision, of our platform for building next-generation AI and robotics applications using neocortical algorithms. In addition, we describe the details of Monty, the first implementation of a thousand-brains system. Named in honor of Vernon Mountcastle, who argued that the power of the mammalian brain lies in its re-use of cortical columns as the primary computational unit, Monty represents a fundamentally new way of building AI systems.
The Monty project incorporates a lot of new concepts and ideas and will require considerable learning on the part of contributors who want to make significant code contributions or to use the code to solve real-world problems. With that in mind, we've tried to make the project easy to comprehend and get started with. To understand the fundamental principles of the project, these are some resources that we recommend:
- 🧠 Vision of the Thousand Brains Project which describes the guiding principles of the project.
- 🎥 YouTube Videos that contain in depth descriptions of the project and the principles that guide it. The Quick Start playlist is the fastest way to learn the basics.
- 📚 Tutorials which are step-by-step guides for using Monty.
- 💬 Discourse Forum which is a community forum for discussing the project and a great place for beginners to get answers to questions.
- ❓ You can also check out our FAQs for the Thousand Brains Project, and the underlying Monty algorithms.
The documentation is broken into six main sections:
| Section | Description |
|---|---|
| 🔍 Overview | Learn about the core principles and architecture behind the Thousand Brains Project. This section also talks about potential practical applications of the system and presents the abilities of our current implementation. |
| 🚀 How to Use Monty | Get started quickly with step-by-step guides for installing, configuring, and running your first Monty experiments. |
| ⚙️ How Monty Works | Dive deep into the concrete algorithms that make Monty work and understand how the different components function together. |
| 🤝 Contributing | Discover the many ways you can contribute to the project - from code and documentation to testing and ideas. |
| 👥 Community | Join our welcoming community! Learn about our guidelines, code of conduct, and how to participate effectively. |
| 🔮 Future Work | Explore exciting opportunities to help shape Monty's future by contributing to planned features and improvements. |
We are excited to have you here! Our intention for making the project open-source is to foster a community of researchers and developers interested in contributing to the project and to allow all of humanity to benefit from these advances in AI. The Thousand Brains Project team is quite small, and we have a limited amount of time, so please be patient with our responses and know that we'll do our best to get back to you as soon as possible. That said, here's a list of ways you can contribute to the project.
| Resource | Description |
|---|---|
| Access our source code, contribute features, report issues and collaborate with other developers | |
| Discuss ideas, ask questions and connect with other community members | |
| Learn about our mission, team and the science behind the project | |
| Watch tutorials, technical deep-dives and project updates | |
| Get the latest news and announcements, and engage with our community | |
| Get the latest news and announcements |
If you're writing a publication that references the Thousand Brains Project or Monty, please cite our papers as appropriate
Thousand Brains Project white paper:
@misc{thousandbrainsproject2024,
title={The Thousand Brains Project: A New Paradigm for Sensorimotor Intelligence},
author={Viviane Clay and Niels Leadholm and Jeff Hawkins},
year={2024},
eprint={2412.18354},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2412.18354},
}
Hierarchy or Heterarchy? A Theory of Long-Range Connections for the Sensorimotor Brain:
@misc{hawkins2025hierarchyheterarchytheorylongrange,
title={Hierarchy or Heterarchy? A Theory of Long-Range Connections for the Sensorimotor Brain},
author={Jeff Hawkins and Niels Leadholm and Viviane Clay},
year={2025},
eprint={2507.05888},
archivePrefix={arXiv},
primaryClass={q-bio.NC},
url={https://arxiv.org/abs/2507.05888},
}
Thousand-Brains Systems: Sensorimotor Intelligence for Rapid, Robust Learning and Inference:
@misc{leadholm2025thousandbrainssystemssensorimotorintelligence,
title={Thousand-Brains Systems: Sensorimotor Intelligence for Rapid, Robust Learning and Inference},
author={Niels Leadholm and Viviane Clay and Scott Knudstrup and Hojae Lee and Jeff Hawkins},
year={2025},
eprint={2507.04494},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2507.04494},
}
This section contains a list of articles, blogs and videos related to / about the Thousand Brains Project. Click the "Make a Contribution" button at the bottom of the page to open a PR to add your own content here.
You can also read articles on the newsroom page for higher level coverage of the Thousand Brains Project. https://thousandbrains.org/about/newsroom/
Active Learning Machines: What Thousand Brains Theory and Piaget Reveal About True Intelligence, By Greg Robinson https://gregrobison.medium.com/active-learning-machines-what-thousand-brains-theory-and-piaget-reveal-about-true-intelligence-304b5c9aa82e
Mapping Reality: From Ancient Navigation to AI's Spatial Innovation, By Greg Robinson https://gregrobison.medium.com/mapping-reality-from-ancient-navigation-to-ais-spatial-innovation-da1d2d2a8659
Learning to Forget: Why Catastrophic Memory Loss Is AI's Most Expensive Problem, By Greg Robinson https://gregrobison.medium.com/learning-to-forget-why-catastrophic-memory-loss-is-ais-most-expensive-problem-d764f5ee36b7
Hands-On Intelligence: Why the Future of AI Moves Like a Curious Toddler, Not a Supercomputer, By Greg Robinson https://gregrobison.medium.com/hands-on-intelligence-why-the-future-of-ai-moves-like-a-curious-toddler-not-a-supercomputer-8a48b67d0eb6
Sensorimotor Intelligence: The Thousand Brains Pathway to More Human-Like AI, By Greg Robinson https://gregrobison.medium.com/sensorimotor-intelligence-the-thousand-brains-pathway-to-more-human-like-ai-a4887320100a
Neural Anarchy: What Happens When Psychedelics Hack Your Brain's Democratic Process, By Greg Robinson https://gregrobison.medium.com/neural-anarchy-what-happens-when-psychedelics-hack-your-brains-democratic-process-4ce3060199c2
Cortical Comedy: How Your Brain's Voting System Creates the Experience of Funny, By Greg Robinson https://gregrobison.medium.com/cortical-comedy-how-your-brains-voting-system-creates-the-experience-of-funny-607aceac3296
Embodied Intelligence: How Neuroscience, Predictive Learning, and 3D Simulation Are Converging to Create AI That Acts in the World, By Greg Robinson https://gregrobison.medium.com/embodied-intelligence-how-neuroscience-predictive-learning-and-3d-simulation-are-converging-to-c266a62954f9
Cortical Columns, By Artem Kirsanov
[block:embed] { "html": "<iframe class="embedly-embed" src="//cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FDykkubb-Qus%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DDykkubb-Qus&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FDykkubb-Qus%2Fhqdefault.jpg&type=text%2Fhtml&schema=youtube" width="854" height="480" scrolling="no" title="YouTube embed" frameborder="0" allow="autoplay; fullscreen; encrypted-media; picture-in-picture;" allowfullscreen="true"></iframe>", "url": "https://www.youtube.com/watch?v=Dykkubb-Qus", "title": "2025/02 - Cortical Circuit Overview", "favicon": "https://www.youtube.com/favicon.ico", "image": "https://i.ytimg.com/vi/Dykkubb-Qus/hqdefault.jpg", "provider": "https://www.youtube.com/", "href": "https://www.youtube.com/watch?v=Dykkubb-Qus", "typeOfEmbed": "youtube" } [/block]
The system implemented in the Thousand Brains Project is designed to be a general-purpose AI system. It is not designed to solve a specific task or set of tasks. Instead, it is designed to be a platform that can be used to build a wide variety of AI applications. The design of an operating system or a programming language does not define what a user can apply it to. Similarly, the Thousand Brains Project will provide the tools necessary to solve many of today's current problems, as well as completely new and unanticipated ones, without being specific to any one of them.
Even though we cannot predict the ultimate use cases of the system, we want to test it on a variety of tasks and keep a set of capabilities in mind when designing the system. The basic principle here is that it should be able to solve any task the neocortex can solve. If we come up with a new mechanism that makes it fundamentally impossible to do something the neocortex can do, we need to rethink the mechanism.
Following is a list of capabilities that we are always thinking about when designing and implementing the system. We are not looking for point solutions for each of these problems but a general algorithm that can solve them all. It is by no means a comprehensive list but should give an idea of the scope of the system.
-
Recognizing objects independent of their location and orientation in the world.
-
Determining the location and orientation of an object relative to the observer, or to another object in the world.
-
Performing learning and inference under noisy conditions.
-
Learning from a small number of samples.
-
Learning from continuous interaction with the environment with no explicit supervision, whilst maintaining previously learned representations.
-
Recognizing objects when they are partially occluded by other objects.
-
Learning categories of objects and generalizing to new instances of a category.
-
Learning and recognizing compositional objects, including novel combinations of their parts.
-
Recognizing objects subject to novel deformations (e.g., Dali's "melting clocks", a crumpled up t-shirt, or objects learned in 3D but seen in 2D).
-
Recognizing an object independent of its scale, and estimating its scale.
-
Modeling and recognizing object states and behaviors (e.g., whether a stapler is open or closed; whether a person is walking or running, and how their body evolves over time under these conditions).
-
Using learned models to alter the world and achieve goals, including goals that require decomposition into simpler tasks. The highest-level, overarching goals can be set externally.
(These will generalize from the same principles that the previous capabilities are built upon):
-
Generalizing modeling to abstract concepts derived from concrete models.
-
Modeling language and associating it with grounded models of the world.
-
Modeling other entities ("Theory of Mind").
Finally, here is a video that walks through all of the system's current capabilities (including hard data) and our current thoughts and plans for future capabilities.
[block:embed] { "html": "<iframe class="embedly-embed" src="//cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2F3d4DmnODLnE%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D3d4DmnODLnE&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2F3d4DmnODLnE%2Fhqdefault.jpg&type=text%2Fhtml&schema=youtube" width="854" height="480" scrolling="no" title="YouTube embed" frameborder="0" allow="autoplay; fullscreen; encrypted-media; picture-in-picture;" allowfullscreen="true"></iframe>", "url": "https://www.youtube.com/watch?v=3d4DmnODLnE", "title": "2025/07 - Thousand-Brains Systems: Sensorimotor Intelligence for Rapid Robust Learning and Inference", "favicon": "https://www.youtube.com/favicon.ico", "image": "https://i.ytimg.com/vi/3d4DmnODLnE/hqdefault.jpg", "provider": "https://www.youtube.com/", "href": "https://www.youtube.com/watch?v=3d4DmnODLnE", "typeOfEmbed": "youtube" } [/block]
Several of the ideas and ways of thinking introduced in this document may be counter-intuitive to people used to the way of thinking prominent in current AI methods, including deep learning. For example, ideas about intelligent systems, learning, models, hierarchical processing, or action policies that you already have in mind might not apply to the system that we are describing. We therefore ask the reader to try and dispense with as many preconceptions as possible and to understand the ideas presented here on their own terms. We are happy to discuss any questions or thoughts that may arise from reading this document. Please reach out to us at [email protected].
Below, we highlight some of the most important differences between the system we are trying to build here and other AI systems.
-
We are building a sensorimotor system. It learns by interacting with the world and sensing different parts of it over time. It does not learn from a static dataset. This is a fundamentally different way of learning than most leading AI systems today and addresses a (partially overlapping) different set of problems.
-
We will introduce learning modules as the basic, repeatable modeling unit, comparable to a cortical column. An important thing to point out here is that none of these modeling units receives the full sensory input. For example in vision, there is no 'full image' anywhere. Each sensor senses a small patch in the world. This is in contrast to many AI systems today where the whole input is fed into a single model.
-
Despite the previous point, each modeling system can learn complete models of objects and recognize them on its own. A single modeling unit should be able to perform all basic tasks of object recognition and manipulation. Using more modeling units makes the system faster and more efficient, and supports compositional and abstract representations, but a single learning module is itself a powerful system. In the single model scenario, inference always requires movement to collect a series of observations, in the same way that recognizing a coffee cup with one of your fingers requires moving across its surface.
-
All models are structured by reference frames. An object is not just a bag of features. It is a collection of features at locations. The relative locations of features to each other is more important than the features themselves.
A central long-term goal is to build a universal Platform and messaging protocol for intelligent sensorimotor systems. We call this protocol the "Cortical Messaging Protocol" (CMP). The CMP can be used as an interface between different custom modules, and its universality is central to the ease of use of the SDK we are developing. For instance, one person may have modules optimized for flying drones using birds-eye observations, while another may be working with different sensors and actuators regulating a smart home. Those two are quite different modules but they should be able to communicate on the same channels defined here. Third parties could develop sensor modules and learning modules according to their specific requirements but they would be compatible with all existing modules due to a shared messaging protocol.
A second goal of the Thousand Brains Project (TBP) is to be a catalyst for a whole new way of thinking about machine intelligence. The principles of the TBP differ from many principles of popular AI methodologies today and are more in line with the principles of learning in the brain. Most concepts presented here derive from the Thousand Brains Theory (TBT) (Hawkins et al., 2019) and experimental evidence about how the brain works. Modules in Monty are inspired by cortical columns in the neocortex (Mountcastle, 1997). The CMP between modules relies on sparse location and reference frame-based data structures. They are analogous to long-range connections in the neocortex. In our implementation, we do not need to strictly adhere to all biological details and it is important to note that should an engineering solution serve us better for implementing certain aspects, then it is acceptable to deviate from the neuroscience and the TBT. In general, the inner workings of the modules can be relatively arbitrary and do not have to rely on neuroscience as long as they adhere to the CMP. However, the core principles of the TBP are motivated by what we have learned from studying the neocortex.
Third, this project aims to eventually bring together prior work into a single framework, including sparsity, active dendrites, sequence memory, and grid cells (Hawkins and Ahmad, 2016; Hawkins, Ahmad, and Cui, 2017; Ahmad and Scheinkman, 2019; Hawkins et al., 2019; Lewis et al., 2019).
Finally, it will be important to showcase the capabilities of our SDK. We will work towards creating a non-trivial demo where the implementation can be used to showcase capabilities that would be hard to demonstrate with any other type of AI system. This may not be one specific task but could play to the strength of this system to tackle a wide variety of tasks. We will also work on making Monty an easy-to-use open-source SDK that other practitioners can apply and test on their applications. We want this to be a platform for all kinds of sensorimotor applications and not just a specific technology showcase.
We have a set of guiding principles that steer the Thousand Brains Project. Throughout the life of the project there may be several different implementations and within each implementation there may be different versions of the core building blocks but everything we work on should follow these core principles:
-
Sensorimotor learning and inference: We are using actively generated temporal sequences of sensory inputs instead of static inputs.
-
Modular structure: Easily expandable and scalable.
-
Cortical Messaging Protocol: Inner workings of modules are highly customizable but their inputs and outputs adhere to a defined protocol such that many different sensor modules (and modalities) and learning modules can work together seamlessly.
-
Voting: A mechanism by which a collection of experts can use different information and models to come to a faster, more robust and stable conclusion.
-
Reference frames: The learned models should have inductive biases that make them naturally good at modeling a structured 4D world. The learned models can be used for a variety of tasks such as manipulation, planning, imagining previously unseen states of the world, fast learning, generalization, and many more.
-
Rapid, continual learning where learning and inference are closely intertwined: supported by sensorimotor embodiment and reference frames, biologically plausible learning mechanisms enable rapid knowledge accumulation and updates to stored representations while remaining robust under the setting of continual learning. There is also no clear distinction between learning and inference. We are always learning, and always performing inference.
Our goal in the near term is to continue building a sensorimotor learning framework based on the principles listed above with a general set of abilities for modeling and interacting with the world. We want to understand and flesh out some of the key issues and mechanisms of learning in such a modular, sensorimotor setup. Two key issues we will focus on next are learning compositional objects using hierarchy, and using learned object models to enable sophisticated ('model-based') action policies.
In the current stage of building up the Monty framework, we are focusing on the two basic components; learning modules and sensor modules, and the communication between them. In the initial implementation, many components are deliberately not biologically constrained, and/or simplified, so as to support visualizing, debugging, and understanding the system as a whole. For example, object models are currently based on explicit graphs in 3D Cartesian space. In the future, these elements may be substituted with more powerful, albeit more inscrutable neural components.
Another goal for the coming months is to open-source and communicate our progress and achievements so far. We want to make it easy for others to join the project and contribute to the Platform. We will provide access to the simple SDK and examples to get started. We also want to spread the ideas of the Thousand Brains Theory and the corresponding architecture to a wider audience. We aim to do this by writing blog posts, releasing videos, open-sourcing our code, publishing papers, and creating a community around the project.
For more details on our short-term goals see our project roadmap and current quarterly goals.
For comparison, the following results were obtained with the previous LM version (FeatureGraphLM). These results have not been updated since September 22nd 2022. Results here were obtained with more densely sampled models than results presented for the evidence LM. This means it is less likely for new points to be sampled. With the current, more sparse, models used for the EvidenceLM the FeatureGraphLM would have reduced performance.
Runtimes are reported on laptop with 8 CPUs and no parallelization.
| Experiment | # objects | tested rotations | new sampling | other | Object Detection Accuracy | Rotation Error | Run Time |
|---|---|---|---|---|---|---|---|
| full_rotation_eval_all_objects | 77 YCB | 32 (xyz, 90) | no | 73.62% | 4076min (68hrs) | ||
| full_rotation_eval | 4 YCB | 32 (xyz, 90) | no | 98.44% | 0.04 rad | 5389s (89min) | |
| partial_rotation_eval_base | 4 YCB | 3 (y, 90) | no | 100% | 0 rad | 264s (4.4min) | |
| sampling_learns3_infs5 | 4 YCB | 3 (y, 90) | yes | 75% | 0.15 rad | 1096s (18.3min) | |
| sampling_3_5_no_pose | 4 YCB | 3 (y, 90) | yes | don't try to determine pose | 100% | - | 1110s (18.5min) |
| sampling_3_5_no_pose_all_rot | 4 YCB | 32 (xyz, 90) | yes | don't try to determine pose | 96.55% | - | 1557s (25.9min) |
| sampling_3_5_no_curv_dir | 4 YCB | 3 (y, 90) | yes | not using curvature directions | 91.67% | 0.03 rad | 845s (14min) |
- Why is
full_rotation_eval_all_objectsso much worse thanfull_rotation_eval?
The difference is that we test 77 objects instead of just 4. There are a lot of objects in the YCB dataset that are quite similar (i.e., a_cups, b_cups, ..., e_cups) and if we have all of them in memory there is more chance for confusion between them. Additionally, the 4 objects infull_rotation_evalare quite distinguishable and have fewer symmetries within themselves than some of the other objects do (like all the balls in the YCB dataset). - Why is
full_rotation_eval_all_objectsso slow?
In this experiment, we test all 77 YCB objects. This means that we also have to store models of all objects in memory and check sensory observations against all of them. At step 0 we have to test#possible_objectsx#possible_locationsx#possible_rotation_per_locationwhich in this case is roughly 78 x 33.00 x 2 = 514.800. If we only test 4 objects this is just 26.400. Additionally, we test all rotation combinations along all axes which are 32 combinations. - Why do new sampling experiments work better if we don't determine the pose?
First, the algorithm comes to a solution faster if the terminal condition does not require a unique pose. This makes it less likely to observe an inconsistent observation caused by the new sampling. Second, we don't have to rely much on the curvature directions (used to inform possible poses otherwise) which can be quite noisy and change fast with new sampling. - Why do the new sampling experiments take longer to run?
The experiments reported here use step size 3 to learn the object models and step size 5 to recognize them. This ensures that we sample completely new points on the object. However, building a model with step size 3 leads to them containing a lot more points. This scales the number of hypotheses that need to be tested (approx. 4 x 12.722 x 2 = 101.776)
To consolidate these concepts, please see figure below for a potential instantiation of the system in a concrete setting. In the example, we see how the system could be applied to sensing and recognizing objects and scenes in a 3D environment using several different sensors, in this case touch and vision.
While this hopefully makes the key concepts described above more concrete and tangible, keep in mind that this is just one way in which the architecture can be instantiated. By design, the Platform can be applied to any application that involves sensing and active interaction with an environment. Indeed, this might include more abstract examples such as browsing the web, or interacting with the instruments that control a scientific experiment.
See our implementation documentation for details on how we implement this architecture in Monty.
[block:embed] { "html": "<iframe class="embedly-embed" src="//cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2F8IfIXQ2y2TM%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D8IfIXQ2y2TM&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2F8IfIXQ2y2TM%2Fhqdefault.jpg&type=text%2Fhtml&schema=youtube" width="854" height="480" scrolling="no" title="YouTube embed" frameborder="0" allow="autoplay; fullscreen; encrypted-media; picture-in-picture;" allowfullscreen="true"></iframe>", "url": "https://www.youtube.com/watch?v=8IfIXQ2y2TM", "title": "2023/06 - The Cortical Messaging Protocol", "favicon": "https://www.youtube.com/favicon.ico", "image": "https://i.ytimg.com/vi/8IfIXQ2y2TM/hqdefault.jpg", "provider": "https://www.youtube.com/", "href": "https://www.youtube.com/watch?v=8IfIXQ2y2TM", "typeOfEmbed": "youtube" } [/block]
We use a common messaging protocol that all components (LMs, SMs, and motor systems) adhere to. This makes it possible for all components to communicate with each other and to combine them arbitrarily. The CMP defines what information the outputs of SMs and LMs need to contain.
In short, a CMP-compliant output contains features at a pose. The pose contains a location in 3D space (naturally including 1D or 2D space) and represents where the sensed features are relative to the body, or another common reference point such as a landmark in the environment. The pose also includes information about the feature's 3D rotation. Additionally, the output can contain features that are independent of the object's pose such as color, texture, temperature (from the SM), or object ID (from the LM).
Besides features and their poses, the standard message packages also include information about the sender's ID (e.g., the particular sensor module) and a confidence rating.
The inputs and outputs of the system (raw sensory input to the SM and motor command outputs from the policy) can have any format and do not adhere to any messaging protocol. They are specific to the agents' sensors and actuators and represent the systems interface with the environment.
The lateral votes between learning modules communicate unions of possible poses and objects. They do not contain any information about "features" from the perspective of that learning module's level of hierarchical processing. In other words, while an LM's object ID might be a feature at higher levels of processing, lateral votes do not send information about the features which that learning module itself has received. We further note that the vote output from one LM can also include multiple CMP message packages, representing multiple possible hypotheses.
At no point do we communicate structural model information between learning modules. What happens within a learning module does not get communicated to any other modules and we never share the models stored in an LMs memory.
Communication between components (SMs, LMs, and motor systems) happens in a common reference frame (e.g., relative to the body). This makes it possible for all components to meaningfully interpret the pose information they receive. Internally, LMs then calculate displacements between consecutive poses and map them into the model's reference frame. This makes it possible to detect objects independently of their pose.
The common reference frame also supports voting operations accounting for the relative displacement of sensors, and therefore LM models. For example, when two fingers touch a coffee mug in two different parts, one might sense the rim, while the other senses the handle. As such, "coffee mug" will be in both of their working hypotheses about the current object. When voting however, they do not simply communicate "coffee mug", but also where on the coffee mug other learning modules should be sensing it, according to their relative displacements. As a result, voting is not simply a "bag-of-features" operation, but is dependent on the relative arrangement of features in the world.
See our implementation documentation for details on how we implement the CMP in Monty.
The basic building block for sensorimotor processing and modeling the output from the sensor module is a learning module. These are repeating elements, each with the same input and output information format. Each learning module should function as a stand-alone unit and be able to recognize objects on its own. Combining multiple learning modules can speed up recognition (e.g. recognizing a cup using five fingers vs. one), allows for learning modules to focus on storing only some objects, and enables learning compositional objects.
Learning modules receive either feature IDs from a sensor or estimated object IDs (also interpreted as features) from a lower-level learning module [1]. The feature or object representation might be in the form of a discrete ID (e.g., the color red, a cylinder), or could be represented in a more high dimensional space (e.g., an SDR representing hue or corresponding to a fork-like object). Additionally, learning modules receive the feature's or object's pose relative to the body, where the pose includes location and rotation. In this way, pose relative to the body serves as a common reference frame for spatial computations, as opposed to the pose of features relative to each individual sensor. From this information higher level learning modules can build up graphs of compositional objects (e.g., large objects or scenes) and vote on the ID of the currently visible object(s).
The features and relative poses are incorporated into a model of the object. All models have an inductive bias towards learning the world as based in 3-dimensional space with an additional temporal dimension. However the exact structure of space can potentially be learned, such that the lower-dimensional space of a melody, or the abstract space of a family tree, can be represented. When interacting with the physical world, the 3D inductive bias is used to place features in internal models accordingly.
The learning module therefore encompasses two major principles of the TBT: Sensorimotor learning, and building models using reference frames. Both ideas are motivated by studies of cortical columns in the neocortex (see figure below).
Besides learning new models, the learning module also tries to match the observed features and relative poses to already learned models stored in memory. In addition to performing such inference within a single LM, an LM's current hypotheses can be sent through lateral connections to other learning modules using the cortical messaging protocol. We note again that the CMP is independent of modality, and as such, LMs that have learned objects in different modalities (e.g., vision vs. touch), can still 'vote' with each other to quickly reach a consensus. This voting process is inspired by the voting process described in Hawkins, Ahmad, and Cui, 2017. Unlike when the CMP is used for the input and output of an LM, votes consist of multiple CMP-compliant messages, representing the union of multiple possible object hypotheses.
To generate the LM's output, we need to get the pose of the sensed object relative to the body. We can calculate this from the current incoming pose (pose of the sensed feature relative to the body) and the poses stored in the model of the object. This pose of the object can then be passed hierarchically to another learning module in the same format as the sensory input (features at a pose relative to the body).
Once the learning module has determined the ID of an object and its pose, it can take the most recent observations (and possibly collect more) to update its model of this object. We can therefore continually learn more about the world and learning and inference are two intertwined processes.
See our implementation documentation for details on how we implement learning modules in Monty.
1: By object, we mean a discrete entity composed of a collection of one or more other objects, each with their own associated pose. As such, an object could also be a scene or any other composition of sub-objects. At the lowest level of object hierarchy, an object is composed of 'proto-objects' (commonly thought of as features), which are also discrete entities with a location and orientation in space, but which are output by the sensor modules; as such, these cannot be further decomposed into constituent objects. Wherever an object (or proto-object) is being processed at a higher level, it can also be referred to as a feature.
Below are additional details of the architecture, including how the three components outlined above interact. Further details of how these are implemented in Monty can be found in the following chapter.
Learning modules can be stacked in a hierarchical fashion to process larger input patches and higher-level concepts. A higher-level learning module receives feature and pose information from the output of a lower-level module and/or from a sensor patch with a larger receptive field, mirroring the connectivity of the cortex. The lower-level LM never sees the entire object it is modeling at once but infers it either through multiple consecutive movements and/or voting with other modules. The higher-level LM can then use the recognized model ID as a feature in its own models. This makes it more efficient to learn larger and more complex models as we do not need to represent all object details within one model. It also makes it easier to make use of object compositionality by quickly associating different object parts with each other as relative features in a higher-level model.
Additionally to learning on different spatial scales, modules can learn on different temporal scales. A low-level module may slowly learn to model general input statistics while a higher-level module may quickly build up temporary graphs of the current state of the world, as a form of short-term memory. Of course, low-level modules may also be able to learn quickly, depending on the application. This could be implemented by introducing a (fixed or learnable) speed parameter for each learning module.
Learning modules have lateral connections to each other to communicate their estimates of the current object ID and pose. For voting, we use a similar feature-pose communication as we use to communicate to higher-level modules. However, in this case we communicate a union of all possible objects and poses under the current evidence (multiple messages adhering to the CMP). Through the lateral voting connections between modules they try to reach a consensus on which object they are sensing at the moment and its pose (see figure below). This helps to recognize objects faster than a single module could.
The movement information (pose displacement) can be a copy of the selected action command (efference copy) or deduced from the sensory input. Without the efference copy, movement can for example be detected from optical flow or proprioception. Sensor modules use movement information to update their pose relative to the body. Learning modules use it to update their hypothesized location within an object's reference frame.
Each learning module produces a motor output. The motor output is formalized as a goal state and also adheres to the common messaging protocol. The goal state could for example be generated using the learned models and current hypotheses by calculating a sensor state which would resolve the most uncertainty between different possible object models. It can also help to guide directed and more efficient exploration to known features in a reference frame stored in memory. Different policies can be leveraged depending on whether we are trying to recognize an object or trying to learn new information about an object.
Hierarchy can also be leveraged for goal-states, where a more abstract goal-state in a high-level learning module can be achieved by decomposing it into simpler goal-states for lower-level learning modules. Importantly, the same learning modules that learn models of objects are used to generate goal-states, enabling hierarchical, model-based policies, no matter how novel the task.
See our implementation documentation for details on how we implement policies in Monty.
The architecture is an entire sensorimotor system. Each learning module receives sensory input and an efference copy of the motor command and outputs a feature-at-pose along with a motor command. Since many modules may produce conflicting motor commands (e.g., different patches on the retina cannot move in opposite directions) they usually need to be coordinated in a motor area. This motor area contains an action policy that decides which action commands to execute in the world based on the motor outputs from all learning modules. It also needs to translate the goal state outputs of the learning modules into motor commands for the actuators. It then sends this motor command to the actuators of the body and an efference copy of it back to the sensor modules.
In the brain, a lot of this processing occurs subcortically. Therefore in our system, we also don't need to resolve these issues within a learning module but can do it within a separate motor area. However, we need to keep in mind that the motor area does not know about the models of objects that are learned in the learning modules and therefore needs to receive useful model-based motor commands from the LMs.
Learned models in the memory of the learning module can be used to make predictions about future observations. If there are multiple models that match the current observations, the predictions would have more uncertainty attached to them. The prediction error can be used as a learning signal to update models or as a criterion for matching during object recognition.
Currently there is no prediction in time, although in the future such capabilities will be added via the inclusion of a temporal dimension. This will help support encoding behaviors of objects, as well as predictions that can be used for motor-policy planning. For example, the long-term aim is for the architecture to be able to predict how a simple object such as a stapler evolves as it is opened or closed, or to coarsely model the physical properties of common materials.
Each sensor module receives information from a small sensory patch as input. This is analogous to a small patch on the retina (as in the figure below), or a patch of skin, or the pressure information at one whisker of a mouse. One sensor module sends information to one learning module which models this information. Knowledge of the whole scene is integrated through the communication between multiple learning modules that each receive different patches of the input space or through time by one learning module moving over the scene.
The sensor module contains a sensor and associates the input to the sensor with a feature representation. The information processing within the sensor module can turn the raw information from the sensor patch into a common representation (e.g. a Sparse Distributed Representation or SDR) which could be compared to light hitting the retina and being converted into a spike train, the pulses of electricity emitted by biological neurons. Additionally, the feature pose relative to the body is calculated from the feature's pose relative to the sensor and the sensor pose relative to the body. This means each sensor module outputs its current pose (location and rotation) as well as the features it senses at that pose. We emphasize that the available pose information is central to how the architecture operates.
A general principle of the system is that any processing specific to a modality happens in the sensor module. The output of the sensor module is not modality-specific anymore and can be processed by any learning module. A crucial requirement here is that each sensor module knows the pose of the feature relative to the sensor. This means that sensors need to be able to detect features and poses of features. The system can work with any type of sensor (vision, touch, radar, LiDAR,...) and integrate information from multiple sensory modalities without effort. For this to work, sensors need to communicate sensory information in a common language.
See our implementation documentation for details on how we implement sensor modules in Monty.
The basic repository structure looks as follows:
.
|-- docs/ # .md files for documentation
|-- rfcs/ # Merged RFCs
|-- benchmarks/ # experiments testing Monty
| |-- configs/
| |-- run_parallel.py
| `-- run.py
|-- src/tbp/monty/
| |-- frameworks/
| | |-- actions
| | |-- config_utils
| | |-- environment_utils
| | |-- environments # Environments Monty can learn in
| | |-- experiments # Experiment classes
| | |-- loggers
| | |-- models # LMs, SMs, motor system, & CMP
| | `-- utils
| `-- simulators/
| `-- habitat/
| |-- actions.py
| |-- actuator.py
| |-- agents.py
| |-- configs.py
| |-- environment.py
| |-- sensors.py
| `-- simulator.py
|-- tests/ # Unit tests
|-- tools/
`-- README.md
This is a slightly handpicked selection of folders and subfolders which tries to highlight to most important folders to get started.
The frameworks, simulators, and tests folders contain many files that are not listed here. The main code used for modeling can be found in src/tbp/monty/frameworks/models/.
Below we highlight a few issues that often crop up and can present problems when they are not otherwise apparent:
Be aware that in Numpy, and in the saved CSV result files, quaternions follow the wxyz format, where "w" is the real component. Thus the identity rotation would be [1, 0, 0, 0]. In contrast however, Scipy.Rotation expects them to be in xyzw format. When operating with quaternions, it is therefore important to be aware of what format you should be using for the particular setting.
Note that in Habitat (and therefore the Monty code-base), the "z" direction is positive in the direction coming out of the screen, while "y" is the "up" direction. "x" is positive pointing to the right, again if you are facing the screen.
Note that the rotation that learning modules store in their Most-Likely Hypothesis (MLH) is the rotation required to transform a feature (such as a surface normal) to match the feature on the object in the environment. As such, it is the inverse of the actual orientation of the object in the environment.
For more details, see our documentation on reference frame transformations in Monty.
Note that sensor-based actions (such as set_sensor_pose), update _all_ the sensors associated with that agent. For example, if the view-finder and main patch sensor associated with an agent were at different locations relative to one another, but set-sensor-pose sets a new absolute location (e.g. 0, 0, 0), both sensors will now have this location, and they will lose the relative offset they had before. While the effect will depend on whether the action is in relative or absolute coordinates, the modification of all sensors associated with an agent is an inherent property of Habitat.
For more info on contributing custom modules, see Ways to Contribute to Code
Monty is designed as a modular framework where a user can easily swap out different implementations of the basic Monty components. For instance you should be able to switch out the type of learning module used without changing the sensor modules, environment, or motor system. The basic components of Monty are defined as abstract classes abstract_monty_classes.py. To implement your own custom version of one of these classes, simply subclass from the abstract class or one of its existing subclasses.
Here are step by step instructions of how to customize using the LearningModule class as an example.
Learning modules (LMs) can be loosely compared to cortical columns, and each learning module receives a subset of the input features. At each time step, each learning module performs a modeling step and a voting step. In the modeling step, LMs update in response to sensory inputs. In the voting step, each LM updates in response to the state of other LMs. Note that the operations each LM performs during the step and receive votes steps are customizable.
There is a 3-step process for creating a custom learning module. All other classes can be customized along the same lines.
- Define a new subclass of LearningModule (LM) either in a local projects folder, or in
/src/tbp/monty/frameworks/models. See the abstract class definitions in/src/tbp/monty/frameworks/models/abstract_monty_classes.pyfor the functions that every LM should implement. - Define a config for your learning module. You are encouraged but not required to define a dataclass with all the arguments to your LM like in
/src/tbp/monty/frameworks/config_utils/config_args.py, with subclasses of that dataclass defining default arguments for different common instantiations of your LM. You are encouraged to define a dataclass config that extends theMontyConfigdataclass also in config_args.py, where thelearning_module_configsis a dictionary defining alearning_module_classand itslearning_module_argsfor each learning module. - Define an experiment config in
benchmarks/configs/(or in your own repository or in yourmonty_labprojects folder) and setmonty_configto a config that fully specifies your model architecture, for example to the dataclass extendingMontyConfigdefined above.
You custom config could look like this:
my_custom_config = copy.deepcopy(base_config_10distinctobj_dist_agent)
my_custom_config.update(
monty_config=PatchAndViewSOTAMontyConfig(
learning_module_configs={
learning_module_0={
learning_module_class=YourCustomLM,
learning_module_args=dict(
your_arg1=val1,
your_arg2=val2
)
}
},
),
)
For simplicity we inherit all other default values from the base_config_10distinctobj_dist_agent config in benchmarks/configs/ycb_experiments.py and use the monty_config specified in the PatchAndViewSOTAMontyConfig dataclass.
Note
You can find our code at https://github.com/thousandbrainsproject/tbp.monty
This is our open-source repository. We call it Monty after Vernon Mountcastle, who proposed cortical columns as a repeating functional unit across the neocortex.
Warning
This guide will not work on Windows or non-x86_64/amd64 Linux.
A guide for Windows Subsystem for Linux is available here: Getting Started on Windows via WSL
You can follow these GitHub issues for updates:
If you run into problems, please let us know by opening a GitHub issue or posting in the Monty Code section of our forum.
Warning
While the repository contains a uv.lock file, this is currently experimental and not supported. In the future this will change, but for now, avoid trying to use uv with this project.
It is best practice (and required if you ever want to contribute code) first to make a fork of our repository and then make any changes on your local fork. To do this you can simply visit our repository and click on the fork button as shown in the picture below. For more detailed instructions see the GitHub documentation on Forks.
Next, you need to clone the repository onto your device. To do that, open the terminal, navigate to the folder where you want the code to be downloaded and run git clone repository_path.git. You can find the repository_path.git on GitHub by clicking on the <>Code button as shown below. For more details see the GitHub documentation on cloning.
Note
If you want the same setup as we use at the Thousand Brains Project by default, clone the repository at ${HOME}/tbp/. If you don't have a tbp folder in your home directory yet you can run cd ~; mkdir tbp; cd tbp to create it. It's not required to clone the code in this folder but it is the path we assume in our tutorials.
If you just forked and cloned this repository, you may skip this step, but any other time you get back to this code, you will want to synchronize it to work with the latest changes.
To make sure your fork is up to date with our repository you need to click on Sync fork -> Update branch in the GitHub interface. Afterwards, you will need to get the newest version of the code into your local copy by running git pull inside this folder.
You can also update your code using the terminal by calling git fetch upstream; git merge upstream/main. If you have not linked the upstream repository yet, you may first need to call git remote add upstream upstream_repository_path.git
Monty requires Conda to install its dependencies. For instructions on how to install Conda (Miniconda or Anaconda) on your machine see https://conda.io/projects/conda/en/latest/user-guide/install/index.html.
To setup Monty, use the conda commands below. Make sure to cd into the tbp.monty directory before running these commands.
Note that the commands are slightly different depending on whether you are setting up the environment on an Intel or ARM64 architecture, and whether you are using the zsh shell or another shell.
Note
On Apple Silicon we rely on Rosetta to run Intel binaries on ARM64 and include the softwareupdate --install-rosetta command in the commands below.
You can create the environment with the following commands:
conda env create
conda init zsh
conda activate tbp.montyconda env create
conda init
conda activate tbp.montysoftwareupdate --install-rosetta
conda env create -f environment_arm64.yml --subdir=osx-64
conda init zsh
conda activate tbp.monty
conda config --env --set subdir osx-64softwareupdate --install-rosetta
conda env create -f environment_arm64.yml --subdir=osx-64
conda init
conda activate tbp.monty
conda config --env --set subdir osx-64Note
By default, Conda will activate the base environment when you open a new terminal. If you do not want Conda to change your global shell when you open a new terminal, run:
conda config --set auto_activate_base falseThe best way to see whether everything works is to run the unit tests using this command:
pytestRunning the tests might take a little while (depending on what hardware you are running on), but in the end, you should see something like this:
Note that by the time you read that, there may be more or less unit tests in the code base, so the exact numbers you see will not match with this screenshot. The important thing is that all tests either pass or are skipped and none of the tests fail.
In your usual interaction with this code base, you will most likely run experiments, not just unit tests. You can find experiment configs in the benchmarks/configs/ folder.
A lot of our current experiments are based on the YCB dataset which is a dataset of 77 3D objects that we render in habitat. To download the dataset, run
python -m habitat_sim.utils.datasets_download --uids ycb --data-path ~/tbp/data/habitat
| Models | Archive Format | Download Link |
|---|---|---|
| pretrained_ycb_v10 | tgz | pretrained_ycb_v10.tgz |
| pretrained_ycb_v10 | zip | pretrained_ycb_v10.zip |
Unpack the archive in the ~/tbp/results/monty/pretrained_models/ folder. For example:
mkdir -p ~/tbp/results/monty/pretrained_models/
cd ~/tbp/results/monty/pretrained_models/
curl -L https://tbp-pretrained-models-public-c9c24aef2e49b897.s3.us-east-2.amazonaws.com/tbp.monty/pretrained_ycb_v10.tgz | tar -xzf -
mkdir -p ~/tbp/results/monty/pretrained_models/
cd ~/tbp/results/monty/pretrained_models/
curl -O https://tbp-pretrained-models-public-c9c24aef2e49b897.s3.us-east-2.amazonaws.com/tbp.monty/pretrained_ycb_v10.zip
unzip pretrained_ycb_v10.zip
The folder should then have the following structure:
~/tbp/results/monty/pretrained_models/
|-- pretrained_ycb_v10/
| |-- supervised_pre_training_5lms
| |-- supervised_pre_training_5lms_all_objects
| |-- ...
Note
To unpack an archive you should be able to double click on it.
To unpack via the command line, copy the archive into the ~/tbp/results/monty/pretrained_models/ folder and inside that folder run:
- for a
tgzarchive,tar -xzf pretrained_ycb_v10.tgz. - for a
ziparchive,unzip pretrained_ycb_v10.zip.
If you did not save the pre-trained models in the ~/tbp/results/monty/pretrained_models/ folder, you will need to set the MONTY_MODELS environment variable.
export MONTY_MODELS=/path/to/your/pretrained/models/dirThis path should point to the pretrained_models folder that contains the pretrained_ycb_v10 folders.
If you did not save the data (e.g., YCB objects) in the ~/tbp/data folder, you will need to set the MONTY_DATA environment variable.
export MONTY_DATA=/path/to/your/dataThis path should point to the data folder, which contains data used for your experiments. Examples of data stored in this folder are the habitat folder containing YCB objects, the worldimages folder containing camera images for the 'Monty Meets Worlds' experiments, and the `omniglot' folder containing the Omniglot dataset.
If you would like to log your experiment results in a different folder than the default path (~/tbp/results/monty/) you need to set the MONTY_LOGS environment variable.
export MONTY_LOGS=/path/to/log/folderWe recommend not saving the wandb logs in the repository itself (default save location). If you have already set the MONTY_LOGS variable, you can set the directory like this:
export WANDB_DIR=${MONTY_LOGS}/wandbNow you can finally run an experiment! To do this, simply use this command:
python benchmarks/run.py -e my_experimentReplace my_experiment with the name of one of the experiment configs in benchmarks/configs/. For example, a good one to start with could be randrot_noise_10distinctobj_surf_agent.
If you want to run an experiment with parallel processing to make use of multiple CPUs, simply use the run_parallel.py script instead of the run.py script like this:
python benchmarks/run_parallel.py -e my_experimentA good next step to get more familiar with our approach and the Monty code base is to go through our tutorials. They include follow-along code and detailed explanations on how Monty experiments are structured, how Monty can be configured in different ways, and what happens when you run a Monty experiment.
If you would like to contribute to the project, you can have a look at the many potential ways to contribute, particularly ways to contribute code.
You can also have a look at the capabilities of Monty and our project roadmap to get an idea of what Monty is currently capable of and what features our team is actively working on.
If you run into any issues or questions, please head over to our Discourse forum or open an Issue. We are always happy to help!
Actions are how Monty moves Agents in its Environment. Monty's Motor System outputs Actions, which are then actuated within the Environment. An Agent is what Monty calls its end effectors, which carry out actions in the environment, physically or in simulation (like a robotic arm or a camera), or virtually (like a web browser navigating the Internet).
Actions are coupled to what can be actuated in the Environment. For example, if a simple robot can only go forward and backward, then the meaningful actions for that robot are restricted to going forward and backward. The robot environment would be unable to actuate a Jump action.
With that in mind, before creating a new Action, consider what can be actuated in the Environment.
At a high level, Actions are created by the Motor System and actuated by the Environment.
Within the Motor System, Policies either explicitly choose specific Actions by creating them directly (e.g., create MoveForward), or sample random Actions from a pool of actions available (e.g., sample MoveForward from the action space {MoveForward, MoveBack}).
Within the Environment, the Actions are actuated either within a simulator, by a robot, or directly.
Additionally, within an experimental framework, a Positioning Procedure can generate Actions before starting the experiment. This is analogous to an experimenter moving Monty into a starting position. The Positioning Procedure can use privileged information such as labels, ground truth object models, or task instructions, and is independent of Monty and the models learned in its learning modules.
Once you have a new Action in mind, to create it, you'll need to explicitly implement the Action protocol:
class Jump(Action):
def __init__(self, agent_id: AgentID, how_high: float) -> None:
super().__init__(agent_id=agent_id)
self.how_high = how_highIf at any point this action will be read from a file, then be sure to update the ActionJSONDecoder as well:
class ActionJSONDecoder(JSONDecoder):
# ...
def object_hook(self, obj: dict[str, Any]) -> Any:
# ...
elif action == Jump.action_name():
return Jump(
agent_id=obj["agent_id"]
how_high=obj["how_high"]
)
# ...With the above in place, the Motor System can now create the Action as needed.
Action samplers can be used to randomly sample actions from a predefined action space. You can use different samplers to implement different sampling strategies. For example, one sampler could always return the same action. Another sampler could return predetermined sequence of actions. Another sampler could randomly pick an action. Additionally, samplers can parameterize the actions differently. One sampler could configure all movement actions with the same distance to move. Another sampler could randomly sample the specific distance to move. For examples of samplers currently used by Monty, see src/tbp/monty/frameworks/actions/action_samplers.py.
For the Motor System to be able to sample the Action with the help of a sampler, you will need to include a sample method and declare an action-specific sampler protocol:
class JumpActionSampler(Protocol):
def sample_jump(self, agent_id: AgentID) -> Jump: ...
class Jump(Action):
@staticmethod
def sample(agent_id: AgentID, sampler: JumpActionSampler) -> Jump:
return sampler.sample_jump(agent_id)
def __init__(self, agent_id: AgentID, how_high: float) -> None:
super().__init__(agent_id=agent_id)
self.how_high = how_highThis prepares the Action to be used in a sampler.
The sampler itself will subclass ActionSampler and implement the requisite protocol:
class MyConstantSampler(ActionSampler):
def __init__(
self,
actions: list[type[Action]] | None = None,
height: float = 1.8,
rng: Generator | None = None
) -> None:
super().__init__(actions=actions, rng=rng)
self.height = 1.8
def sample_jump(self, agent_id: AgentID) -> Jump:
return Jump(agent_id=agent_id, how_high=self.height)The sampler then will be used along the lines of:
sampler = MyConstantSampler(actions=[Jump])
action = sampler.sample("agent_0")For an Action to take effect within an Environment, it needs to be actuated. Include the act method and declare an action-specific actuator protocol:
class JumpActuator(Protocol):
def actuate_jump(self, action: Jump) -> None: ...
class Jump(Action):
def __init__(self, agent_id: AgentID, how_high: float) -> None:
super().__init__(agent_id=agent_id)
self.how_high = how_high
def act(self, actuator: JumpActuator) -> None:
actuator.actuate_jump(self)The actuator itself will be specific to the Environment and the simulator, robot, or otherwise. However, in general, it will implement the actuator protocol:
class MyActuator:
def actuate_jump(self, action: Jump) -> None:
# custom code to make Jump happen in your system
passThe Environment, simulator, or robot will invoke the actuator by calling the Action's act method:
SomeSimulatorAction = Union[Jump, Tumble] # Actions the simulator can actuate
class SomeSimulator:
def step(self, actions: Sequence[SomeSimulatorAction]) -> Observations
# ...
for action in actions:
action.act(self.actuator)
# ...You could implement the actuator as a Mixin, in that case it would look more like:
class SomeSimulator(MyActuator):
def step(self, actions: Sequence[SomeSimulatorAction]) -> Observations
# ...
for action in actions:
action.act(self)
# ...For an example of an actuator, see src/tbp/monty/simulators/habitat/simulator.py
To manage the logging for an experiment you can specify the handlers that should be used in the logging_config. The logging config has two fields for handlers. One for monty_handlers and one for wandb_handlers. The latter will start a wandb session if it does not contain an empty list. The former can contain all other non-wandb handlers.
| Class Name | Description |
|---|---|
| MontyHandler | Abstract handler class. |
| DetailedJSONHandler | Logs detailed information about every step in a .json file. This is for detailed analysis and visualization. For longer experiments, it is not recommended. |
| BasicCSVStatsHandler | Log a .csv file with one row per episode that contains the results and performance of this episode. |
| ReproduceEpisodeHandler | Logs action sequence and target such that an episode can be exactly reproduced. |
| BasicWandbTableStatsHandler | Logs a table similar to the .csv table to wandb. |
| BasicWandbChartStatsHandler | Logs episode stats to wandb charts. When running in parallel this is done at the end of a run. Otherwise one can follow the run stats live on wandb. |
| DetailedWandbHandler | Logs animations of raw observations to wandb. |
| DetailedWandbMarkedObsHandler | Same as previous but marks the view-finder observation with a square indicating where the patch is. |
When logging to wandb we recommend to runexport WANDB_DIR=~/tbp/results/monty/wandbso the wandb logs are not stored in the repository folder.
The first time you run experiments that log to wandb you will need to set your WANDB_API key using export WANDB_API_KEY=your_key
The plot_utils.py contains utils for plotting the logged data. The logging_utils.py file contains some useful utils for loading logs and printing some summary statistics on them.
Note
Install analysis optional dependencies to use plot_utils.py
pip install -e .'[analysis]'
There are many ways of visualizing the logged data. Below are just some commonly use functions as examples but to get a full picture it is best to have a look at the functions in logging_utils.py and plot_utils.py.
The easiest way to load logged data is using the load_stats function. This function is useful if you use the DetailedJSONHandler or the BasicCSVStatsHandler or simply want to load learned models. You can use the function parameters to load only some of the stats selectively. For example, set load_detailed to False if you didn't collect detailed stats in a JSON file or set load_train to False if you only ran validation.
👍 You can Follow Along with this Code
If you ran the
randrot_10distinctobj_surf_agentbenchmark experiment as described in the Running Benchmarks guide, you should be able to run the code below.
import os
from tbp.monty.frameworks.utils.logging_utils import load_stats
pretrain_path = os.path.expanduser("~/tbp/results/monty/pretrained_models/")
pretrained_dict = pretrain_path + "pretrained_ycb_v10/surf_agent_1lm_10distinctobj/pretrained/"
log_path = os.path.expanduser("~/tbp/results/monty/projects/evidence_eval_runs/logs/")
exp_name = "randrot_10distinctobj_surf_agent"
exp_path = log_path + exp_name
train_stats, eval_stats, detailed_stats, lm_models = load_stats(exp_path,
load_train=False, # doesn't load train csv
load_eval=True, # loads eval_stats.csv
load_detailed=False, # doesn't load .json
load_models=True, # loads .pt models
pretrained_dict=pretrained_dict,
)Alternatively, you can of course always just load the stats files (logged at output_dir specified in the experiment config) using pd.read_csv, json.load, torch.load, or any other library loading function you prefer using.
📘 JSON Logs can get Large Fast
The detailed JSON logs save very detailed information for every step and episode. This includes for example the pixel observations at every step. The .json files can therefore get large very quickly. You should use the
DetailedJSONHandlerwith care and only if needed. Remember to adjust the number of epochs and episodes to be as small as possible.If you need to load a larger JSON file, it makes sense to load it with the
deserialize_json_chunksfunction. This way you can load one episode at a time, process its data, clear memory, and then load the next episode.
We have a couple of useful utils to calculate and print summary statistics from the logged .csv files in logging_utils.py. Below are a few examples of how to use them after loading the data as shown above.
from tbp.monty.frameworks.utils.logging_utils import print_overall_stats
print_overall_stats(eval_stats)from tbp.monty.frameworks.utils.logging_utils import print_unsupervised_stats
print_unsupervised_stats(train_stats, epoch_len=10)👍 If you are Following Along you Should see Something like This:
Detected 100.0% correctly overall run time: 104.03 seconds (1.73 minutes), 1.04 seconds per episode, 0.04 seconds per step.
When loading the lm_models (either using the load_stats function or torch.load) you get a dictionary of object graphs. Object graphs are represented as torch_geometric.data class instances with properties x, pos, norm, feature_mapping where x stores the features at each point in the graph, pos the locations, norm the surface normal and feature_mapping is a dictionary that encodes which indices in x correspond to which features.
There are a range of graph plotting utils in the plot_utils.py file. Additionally, you can find some more examples of how to plot graphs and how to use the functions in GraphVisualizations.ipynb, EvidenceLM.ipynb, and MultiLMViz.ipynb in the monty_lab repository. Below is just one illustrative example of how you can quickly plot an object graph after loading it as shown above.
import matplotlib.pyplot as plt
from tbp.monty.frameworks.utils.plot_utils_dev import plot_graph
# Visualize the object called 'mug' from the pretrained graphs loaded above from pretrained_dict
plot_graph(lm_models['pretrained'][0]['mug']['patch'], rotation=120)
plt.show()from tbp.monty.frameworks.utils.plot_utils_dev import plot_graph
# Visualize how the graph for the first object learned (new_object0) looks in epoch 3 in LM_0
plot_graph(lm_models['3']['LM_0']['new_object0']['patch'])
plt.show()📘 Plotting in 3D
Most plots shown here use the 3D projection feature of matplotlib. The plots can be viewed interactively by dragging the mouse over them to zoom and rotate. When you want to save figures with 3D plots programmatically, it can be useful to set the
rotationparameter in theplot_graphfunction such that the POV provides a good view of the 3D structure of the object.
Since Monty is a sensorimotor framework, everything happens as a timeseries of sensing and acting. Therefor, many aspects of the experiment are better visualized as an animation that can show how hypotheses evolve over time. We have a couple of functions to create animations in the plot_utils.py file. Below is one example how this could be applied to data loaded with the code above.
📘 To Follow Along Here You Need to Use the Detailed Logger
Detailed JSON stats are not logged by default since they can get large quickly. To be able to run the following analysis, you need to update the experiment config with this line:
logging_config=DetailedEvidenceLMLoggingConfig(),Remember that you will also need to import
DetailedEvidenceLMLoggingConfigat the top of the file.It is also recommended to not log too many episodes with the detailed logger so to keep the file size small, we recommend to also update the number of objects tested and number of epochs like this:
# Changes to make to the randrot_10distinctobj_surf_agent config to follow along: experiment_args=EvalExperimentArgs( model_name_or_path=model_path_10distinctobj, n_eval_epochs=1, # <--- Setting n_eval_epochs to 1 max_total_steps=5000, ), logging_config=DetailedEvidenceLMLoggingConfig(), # <--- Setting the detailed logger eval_env_interface_args=EnvironmentInterfacePerObjectArgs( object_names=get_object_names_by_idx( 0, 1, object_list=DISTINCT_OBJECTS # <--- Only testing one object ), object_init_sampler=RandomRotationObjectInitializer(), ),
🚧 TODO: Add code for some of the animate functions
Most of them are for the old LMs so probably won't make sense to show them here. Maybe we should even remove them from the code.
There are some animation functions for policy visualizations. @Niels do you think it makes sense to demo them here?
Data generated from an experiment using the EvidenceLM (currently the default setup) is best plotted using a loop, as shown below.
from tbp.monty.frameworks.utils.plot_utils_dev import (show_initial_hypotheses,
plot_evidence_at_step)
episode = 0
lm = 'LM_0'
objects = ['mug','bowl','dice','banana'] # Up to 4 objects to visualize evidence for
current_evidence_update_threshold = -1
save_fig = True
save_path = exp_path + '/stepwise_examples/'
# [optional] Show initial hypotheses for each point on the object
show_initial_hypotheses(detailed_stats, episode, 'mug', rotation=[120,-90], axis=2,
save_fig=save_fig, save_path=save_path)
# Plot the evidence for each hypothesis on each of the objects & show the observations used for updating
for step in range(eval_stats['monty_matching_steps'][episode]):
plot_evidence_at_step(detailed_stats,
lm_models,
episode,
step,
objects,
is_surface_sensor=True, # set this to False if not using the surface agent
save_fig=save_fig,
save_path=save_path)The above code should create an image like the one shown below for each step in the experiment and save it in a folder called stepwise_examples inside the logs folder of this experiment.
Since the episode statistics are saved in a .csv table, you can also do all the standard plot visualizations of this data (such as bar plots of # episodes correct, # of steps per episode, ...). For example you could create the plot below using this code:
import numpy as np
import seaborn as sns # For this you will have to install seaborn
import matplotlib.pyplot as plt
rot_errs = np.array(eval_stats[eval_stats["primary_performance"]=="correct"]["rotation_error"])
rot_errs = rot_errs * 180 / np.pi
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
sns.histplot(rot_errs)
plt.xlabel("Rotation Error (degrees)")
plt.subplot(1,2,2)
sns.histplot(eval_stats, x="num_steps")
plt.xlabel("# of Steps")
plt.show()Obviously, there are tons of options for visualizing the logged data. The code and images above are just some examples to provide inspiration.
When logging into Wandb, you can track your experiment results at https://wandb.ai/home . When you navigate to your project (by default called "Monty") you should see all your experiments in the Wandb dashboard. If you click on one of the runs, you should see something like this:
If you are using the BasicWandbTableStatsHandler, you will also see a table like this:
You can then create all kinds of plots from this data. A convenient way of doing this is using wandb reports but you can also create custom plots within the dashboard or download the data and plot it with any software of your choice.
title: Running Benchmarks description: Benchmark experiments should be run with every functional change to the code. This is how you do it.
Whenever you change something in the code's functionality, you need to run the benchmark experiments to ensure it doesn't impede performance. Also, if you implement an alternative approach, the benchmark experiments can be a good way to compare it to our current best implementation.
The benchmark test suite is designed to evaluate the performance of Monty in different scenarios. These include rotated and translated objects, noisy observations, similar objects, different action spaces, multiple LMs, multiple objects, real-world data, and continual unsupervised learning.
For more details on the current benchmark experiments, see this page.
When merging a change that impacts the performance on the benchmark experiments, you need to update the table in our documentation here.
To run a benchmark experiment, simply call
python /benchmarks/run.py -e run_nameand replace run_name with the name of the benchmark experiment. All benchmark experiment configs are in the benchmarks/configs/ folder. So for example, for running the quickest benchmark experiment you would call
python benchmarks/run.py -e randrot_10distinctobj_surf_agent👍 Go Ahead and Run the Command Above!
If you run the
randrot_10distinctobj_surf_agentexperiment using the command above, you will be able to follow along with all the following data analysis guides since we use this as an example.This should take about 1.5 minutes on an M3 MacBook or 5 minutes distributed on 16 CPU cores.
🚧 TODO: Figure out a way to have timings comparable
Currently we run all benchmark experiments on our cloud infrastructure, using 16 CPUs. However, external people will not have access to this and their hardware may give very different runtimes. How do we deal with this?
If you are using a wandb logger (used by default in the benchmark experiment configs), you can view the experiment results in the wandb dashboard. If you go into the "Runs" tab (selected on the left), you should see the summary statistics in the columns starting with "overall/".
If you are not using wandb, you can also calculate the statistics from the saved .csv file.
from tbp.monty.frameworks.utils.logging_utils import (load_stats,
print_overall_stats,
print_unsupervised_stats)
_, eval_stats, _, _ = load_stats(log_path + 'run_name',
load_train=False,
load_eval=True,
load_models=False,
load_detailed=False)
print_overall_stats(eval_stats)
# for the learning from scratch experiments, load the training csv instead and call
print_unsupervised_stats(train_stats, epoch_len=10) # 10 is the number of objects shown in an epochIf your code affected any of the benchmark results you should update the benchmark results table here in the same PR. See our guide on contributing to documentation for instructions on how to edit documentation.
These tutorials will help you get familiar with Monty. Topics and experiments increase in complexity as the tutorials progress, so we recommend going through them in order. You can also learn about contributing tutorials here.
The following tutorials are aimed at getting you familiar with the Monty code infrastructure, running experiments, and the general algorithm. They don't require you to write custom code, instead providing code for experiment configurations that you can follow along with.
- Running Your First Experiment: Demonstrates basic layout of an experiment config and outlines experimental training loops.
- Pretraining a Model: Shows how to configure an experiment for supervised pretraining.
- Running Inference with a Pretrained Model: Demonstrates how to load a pretrained model and use it to perform object and pose recognition.
- Unsupervised Continual Learning: Demonstrates how to use Monty for unsupervised continual learning to learn new objects from scratch.
- Multiple Learning Modules: Shows how to perform pretraining and inference using multiple sensor modules and multiple learning modules.
The following tutorials assume that you are already a bit more familiar with Monty. They are aimed at people who understand our approach and now want to customize it for their own experiments and applications. We provide code and configurations to follow along but the tutorials also provide instructions for writing your own custom code.
- Using Monty in a Custom Application [No Habitat Required]: Guides you through the process of letting Monty learn and perform inference in an environment of your choice. Talks about requirements, what needs to be defined, and which classes to customize.
- Using Monty for Robotics [No Habitat Required]: Building on the previous tutorial, this goes into depth on how to use Monty with physical sensors and actuators (aka robots 🤖).
title: Getting Started on Windows via WSL description: How to get the code running using Windows Subsystem for Linux and Visual Studio Code
Note
This guide is based off the main Getting Started guide for macOS and Linux, which you may refer to as needed.
Warning
While the repository contains a uv.lock file, this is currently experimental and not supported. In the future this will change, but for now, avoid trying to use uv with this project.
Follow the Microsoft guide to install Windows Subsystem for Linux: https://learn.microsoft.com/en-us/windows/wsl/install-manual
It is important that you complete every step of their guide, otherwise you may encounter issues during the next steps below.
Regarding the Linux distribution, Ubuntu 24.04 LTS is recommended. Once installed, the Linux filesystem can directly be accessed via the "Linux" section in the Windows File Explorer, which should normally point to this path: \\wsl.localhost\Ubuntu-24.04\
Monty requires Conda to install its dependencies. For WSL, Miniconda is recommended. To download and install Miniconda, follow this guide using the Ubuntu terminal: https://www.anaconda.com/docs/getting-started/miniconda/install#linux-2
It is best practice (and required if you ever want to contribute code) first to make a fork of our repository and then make any changes on your local fork. To do this you can simply visit our repository and click on the fork button as shown in the picture below. For more detailed instructions see the GitHub documentation on Forks.
Next, you need to clone the repository onto the system. To do that, enter the following command in the Ubuntu terminal, adjusting YOUR_GITHUB_USERNAME accordingly:
git clone https://github.com/YOUR_GITHUB_USERNAME/tbp.monty ~/tbpFor more details see the GitHub documentation on cloning.
If you just forked and cloned this repository, you may skip this step, but any other time you get back to this code, you will want to synchronize it to work with the latest changes.
To make sure your fork is up to date with our repository you need to click on Sync fork -> Update branch in the GitHub interface. Afterwards, you will need to get the newest version of the code into your local copy by running git pull inside this folder.
You can also update your code using the terminal by calling git fetch upstream; git merge upstream/main. If you have not linked the upstream repository yet, you may first need to call:
git -C ~/tbp remote add upstream https://github.com/thousandbrainsproject/tbp.monty.gitNext, set up the conda environment for Monty. In the Ubuntu terminal, enter this:
cd ~/tbp
conda tos accept && conda env create && conda init && \
conda activate tbp.montyThis might take a few minutes or more to run, depending on your download speed.
Note
By default, Conda will activate the base environment when you open a new terminal. If you'd prefer to have the Monty environment active by default, enter this:
conda config --set default_activation_env tbp.montyAlso, if you want the ~/tbp folder active by default when opening the Ubuntu terminal, enter this:
echo '[ "$PWD" = "$HOME" ] && cd tbp' >> ~/.bashrcNext, install libopengl0 to allow running Habitat-Sim:
sudo apt -y install libopengl0Then, configure Linux to directly use Windows GPU drivers:
echo "export GALLIUM_DRIVER=d3d12" >> ~/.bashrc && exec $SHELLWarning
Don’t install Linux GPU drivers in WSL, you don’t need them, NVIDIA even warns against installing them.
A lot of our current experiments are based on the YCB dataset which is a dataset of 77 3D objects that we render in habitat. To download the dataset, enter this:
python -m habitat_sim.utils.datasets_download --uids ycb --data-path ~/tbp/data/habitat
You can also get the Pretrained Models for inference testing:
mkdir -p ~/tbp/results/monty/pretrained_models/ && cd "$_"
curl -L https://tbp-pretrained-models-public-c9c24aef2e49b897.s3.us-east-2.amazonaws.com/tbp.monty/pretrained_ycb_v10.tgz | tar -xzf -Optionally, you can get the Monty-Meets-World datasets for real-world testing:
mkdir -p ~/tbp/data/ && cd "$_"
curl -L https://tbp-data-public-5e789bd48e75350c.s3.us-east-2.amazonaws.com/tbp.monty/numenta_lab.tgz | tar -xzf -
curl -L https://tbp-data-public-5e789bd48e75350c.s3.us-east-2.amazonaws.com/tbp.monty/worldimages.tgz | tar -xzf -If you did not save the pre-trained models in the ~/tbp/results/monty/pretrained_models/ folder, you will need to set the MONTY_MODELS environment variable.
export MONTY_MODELS=/path/to/your/pretrained/models/dirThis path should point to the pretrained_models folder that contains the pretrained_ycb_v10 folders.
If you did not save the data (e.g., YCB objects) in the ~/tbp/data folder, you will need to set the MONTY_DATA environment variable.
export MONTY_DATA=/path/to/your/dataThis path should point to the data folder, which contains data used for your experiments. Examples of data stored in this folder are the habitat folder containing YCB objects, the worldimages folder containing camera images for the 'Monty Meets Worlds' experiments, and the `omniglot' folder containing the Omniglot dataset.
If you would like to log your experiment results in a different folder than the default path (~/tbp/results/monty/) you need to set the MONTY_LOGS environment variable.
export MONTY_LOGS=/path/to/log/folderWe recommend not saving the wandb logs in the repository itself (default save location). If you have already set the MONTY_LOGS variable, you can set the directory like this:
export WANDB_DIR=${MONTY_LOGS}/wandbInstall VS Code (the Windows version): https://code.visualstudio.com/download
Install the WSL extension: https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-wsl
Install the Python extension: https://marketplace.visualstudio.com/items?itemName=ms-python.python
From Ubuntu, initialize the settings file for the repo:
mkdir -p ~/tbp/.vscode/ && echo '{ "python.defaultInterpreterPath": "~/miniconda3/envs/tbp.monty/bin/python", "python.testing.pytestEnabled": true, "python.testing.pytestArgs": ["tests"] }' > ~/tbp/.vscode/settings.jsonStill from Ubuntu, enter this to launch VS Code with Monty:
cd ~/tbp && code .If you followed all the previous steps, normally you should have VS Code open on the Monty project, ready to go. Try running the unit tests:
This will take some time, about 10 minutes on an 8-core i7-11700K for example:
Finally, let’s run a benchmark. You can do this in either the Ubuntu terminal or directly in the VS Code terminal. In the VS Code top menu, select Terminal > Open Terminal, then enter:
python benchmarks/run.py -e base_config_10distinctobj_dist_agentIn this case, it took a little over 5 minutes:
A good next step to get more familiar with our approach and the Monty code base is to go through our tutorials. They include follow-along code and detailed explanations on how Monty experiments are structured, how Monty can be configured in different ways, and what happens when you run a Monty experiment.
If you would like to contribute to the project, you can have a look at the many potential ways to contribute, particularly ways to contribute code.
You can also have a look at the capabilities of Monty and our project roadmap to get an idea of what Monty is currently capable of and what features our team is actively working on.
If you run into any issues or questions, please head over to our Discourse forum or open an Issue. We are always happy to help!
Thus far, we have been working with models that use a single agent with a single sensor which connects to a single learning module. In the context of vision, this is analogous to a small patch of retina that picks up a small region of the visual field and relays its information to its downstream target--a single cortical column in the primary visual cortex (V1). In human terms, this is like looking through a straw. While sufficient to recognize objects, one would have to make many successive eye movements to build up a picture of the environment. In reality, the retina contains many patches that tile the retinal surface, and they all send their information to their respective downstream target columns in V1. If, for example, a few neighboring retinal patches fall on different parts of the same object, then the object may be rapidly recognized once columns have communicated with each other about what they are seeing and where they are seeing it.
In this tutorial, we will show how Monty can be used to learn and recognize objects in a multiple sensor, multiple learning module setting. In this regime, we can perform object recognition with fewer steps than single-LM systems by allowing learning modules to communicate with one another through a process called voting. We will also introduce the distant agent, Monty's sensorimotor system that is most analogous to the human eye. Unlike the surface agent, the distant agent cannot move all around the object like a finger. Rather, it swivels left/right/up/down at a fixed distance from the object.
Note
Don't have the YCB Dataset Downloaded?
You can find instructions for downloading the YCB dataset here. Alternatively, you can run these experiments using the builtin Habitat primitives, such as capsule3DSolid and cubeSolid. Simply change the items in the object_names list.
In this section, we'll show how to perform supervised pretraining with a model containing six sensor modules, of which five are connected in a 1:1 fashion to five learning modules (one sensor module is a viewfinder for experiment setup and visualization and is not connected to a learning module). By default, the sensor modules are arranged in cross shape, where four sensor modules are displaced a small distance from the center sensor module like so:

To follow along, open the benchmarks/configs/my_experiments.py file and paste the code snippets into it.
import os
from dataclasses import asdict
from benchmarks.configs.names import MyExperiments
from tbp.monty.frameworks.config_utils.config_args import (
FiveLMMontyConfig,
MontyArgs,
MotorSystemConfigNaiveScanSpiral,
PretrainLoggingConfig,
get_cube_face_and_corner_views_rotations,
)
from tbp.monty.frameworks.config_utils.make_env_interface_configs import (
EnvironmentInterfacePerObjectArgs,
PredefinedObjectInitializer,
SupervisedPretrainingExperimentArgs,
)
from tbp.monty.frameworks.config_utils.policy_setup_utils import (
make_naive_scan_policy_config,
)
from tbp.monty.frameworks.environments import embodied_data as ED
from tbp.monty.frameworks.experiments import (
MontySupervisedObjectPretrainingExperiment,
)
from tbp.monty.frameworks.models.motor_policies import NaiveScanPolicy
from tbp.monty.simulators.habitat.configs import (
FiveLMMountHabitatEnvInterfaceConfig,
)
# Specify directory where an output directory will be created.
project_dir = os.path.expanduser("~/tbp/results/monty/projects")
# Specify a name for the model.
model_name = "dist_agent_5lm_2obj"
# Specify the objects to train on and 14 unique object poses.
object_names = ["mug", "banana"]
train_rotations = get_cube_face_and_corner_views_rotations()
# The config dictionary for the pretraining experiment.
dist_agent_5lm_2obj_train = dict(
# Specify monty experiment class and its args.
# The MontySupervisedObjectPretrainingExperiment class will provide the model
# with object and pose labels for supervised pretraining.
experiment_class=MontySupervisedObjectPretrainingExperiment,
experiment_args=SupervisedPretrainingExperimentArgs(
n_train_epochs=len(train_rotations),
),
# Specify logging config.
logging_config=PretrainLoggingConfig(
output_dir=project_dir,
run_name=model_name,
),
# Specify the Monty model. The FiveLLMMontyConfig contains all of the sensor module
# configs, learning module configs, and connectivity matrices we need.
monty_config=FiveLMMontyConfig(
monty_args=MontyArgs(num_exploratory_steps=500),
motor_system_config=MotorSystemConfigNaiveScanSpiral(
motor_system_args=dict(
policy_class=NaiveScanPolicy,
policy_args=make_naive_scan_policy_config(step_size=5),
)
),
),
# Set up the environment and agent.
env_interface_config=FiveLMMountHabitatEnvInterfaceConfig(),
# Set up the training environment interface.
train_env_interface_class=ED.InformedEnvironmentInterface,
train_env_interface_args=EnvironmentInterfacePerObjectArgs(
object_names=object_names,
object_init_sampler=PredefinedObjectInitializer(rotations=train_rotations),
),
)Finally, add your experiment to MyExperiments at the bottom of the file:
experiments = MyExperiments(
dist_agent_5lm_2obj_train=dist_agent_5lm_2obj_train,
)
CONFIGS = asdict(experiments)If you've read the previous tutorials, much of this should look familiar. As in our pretraining tutorial, we've configured a MontySupervisedObjectPretrainingExperiment with a PretrainLoggingConfig. However, we are now using a built-in Monty model configuration called FiveLMMontyConfig that specifies everything we need to have five HabitatSM sensor modules that each connect to exactly one of five DisplacementGraphLM learning modules. FiveLMMontyConfig also specifies that each learning module connects to every other learning module through lateral voting connections. Note that GraphLM learning modules used in previous tutorials would work fine here, but we're going with the default DisplacementGraphLM for convenience (this is a graph-based LM that also stores displacements between points, although these are generally not used during inference at present). To see how this is done, we can take a closer look at the FiveLMMontyConfig class which contains the following lines:
sm_to_lm_matrix: List = field(
default_factory=lambda: [
[0],
[1],
[2],
[3],
[4],
], # View finder (sm5) not connected to lm
)
# For hierarchically connected LMs.
lm_to_lm_matrix: Optional[List] = None
# All LMs connect to each other
lm_to_lm_vote_matrix: List = field(
default_factory=lambda: [
[1, 2, 3, 4],
[0, 2, 3, 4],
[0, 1, 3, 4],
[0, 1, 2, 4],
[0, 1, 2, 3],
]
)sm_to_lm_matrix is a list where the i-th entry indicates the learning module that receives input from the i-th sensor module. Note, the view finder, which is configured as sensor module 5, is not connected to any learning modules since sm_to_lm_matrix[5] does not exist. Similarly, lm_to_lm_vote_matrix specifies which learning modules communicate with each for voting during inference. lm_to_lm_vote_matrix[i] is a list of learning module IDs that communicate with learning module i.
We have also specified that we want to use a MotorSystemConfigNaiveScanSpiral for the motor system. This is a learning-focused motor policy that directs the agent to look across the object surface in a spiraling motion. That way, we can ensure efficient coverage of the entire object (of what is visible from the current perspective) during learning.
Finally, we have also set the env_interface_config to FiveLMMountHabitatEnvInterfaceConfig. This specifies that we have five HabitatSM sensor modules (and a view finder) mounted onto a single distant agent. By default, the sensor modules cover three nearby regions and otherwise vary by resolution and zoom factor. For the exact specifications, see the FiveLMMountConfig in tbp/monty/frameworks/config_utils/make_env_interface_configs.py.
Before running this experiment, you will need to declare your experiment name as part of the MyExperiments dataclass in the benchmarks/configs/names.py file:
@dataclass
class MyExperiments:
dist_agent_5lm_2obj_train: dictThen navigate to the benchmarks/ folder in a terminal, and call the run.py script like so:
cd benchmarks
python run.py -e dist_agent_5lm_2obj_trainWe will now specify an experiment config to perform inference.
To follow along, open the benchmarks/configs/my_experiments.py file and paste the code snippets into it.
import copy
import os
from dataclasses import asdict
import numpy as np
from benchmarks.configs.names import MyExperiments
from tbp.monty.frameworks.config_utils.config_args import (
EvalLoggingConfig,
FiveLMMontyConfig,
MontyArgs,
MotorSystemConfigInformedGoalStateDriven,
)
from tbp.monty.frameworks.config_utils.make_env_interface_configs import (
EnvironmentInterfacePerObjectArgs,
EvalExperimentArgs,
PredefinedObjectInitializer,
)
from tbp.monty.frameworks.environments import embodied_data as ED
from tbp.monty.frameworks.experiments import (
MontyObjectRecognitionExperiment,
)
from tbp.monty.frameworks.loggers.monty_handlers import BasicCSVStatsHandler
from tbp.monty.frameworks.models.evidence_matching.learning_module import (
EvidenceGraphLM
)
from tbp.monty.frameworks.models.evidence_matching.model import (
MontyForEvidenceGraphMatching
)
from tbp.monty.frameworks.models.goal_state_generation import (
EvidenceGoalStateGenerator,
)
from tbp.monty.simulators.habitat.configs import (
FiveLMMountHabitatEnvInterfaceConfig,
)
"""
Basic Info
"""
# Specify directory where an output directory will be created.
project_dir = os.path.expanduser("~/tbp/results/monty/projects")
# Specify a name for the model.
model_name = "dist_agent_5lm_2obj"
object_names = ["mug", "banana"]
test_rotations = [np.array([0, 15, 30])] # A previously unseen rotation of the objects
model_path = os.path.join(
project_dir,
model_name,
"pretrained",
)As usual, we set up our imports, save/load paths, and specify which objects to use and what rotations they'll be in. For simplicity, we'll only perform inference on each of the two objects once but you could easily test more by adding more rotations to the test_rotations array.
Now we specify the learning module config. For simplicity, we define one learning module config and copy it to reuse settings across learning modules. We need only make one change to each copy so that the tolerances reference the sensor ID connected to the learning module.
"""
Learning Module Configs
"""
# Create a template config that we'll make copies of.
evidence_lm_config = dict(
learning_module_class=EvidenceGraphLM,
learning_module_args=dict(
max_match_distance=0.01, # =1cm
feature_weights={
"patch": {
# Weighting saturation and value less since these might change under
# different lighting conditions.
"hsv": np.array([1, 0.5, 0.5]),
}
},
# Use this to update all hypotheses > 80% of the max hypothesis evidence
evidence_threshold_config="80%",
x_percent_threshold=20,
gsg_class=EvidenceGoalStateGenerator,
gsg_args=dict(
goal_tolerances=dict(
location=0.015, # distance in meters
), # Tolerance(s) when determining goal-state success
min_post_goal_success_steps=5, # Number of necessary steps for a hypothesis
),
hypotheses_updater_args=dict(
max_nneighbors=10,
)
),
)
# We'll also reuse these tolerances, so we specify them here.
tolerance_values = {
"hsv": np.array([0.1, 0.2, 0.2]),
"principal_curvatures_log": np.ones(2),
}
# Now we make 5 copies of the template config, each with the tolerances specified for
# one of the five sensor modules.
learning_module_configs = {}
for i in range(5):
lm = copy.deepcopy(evidence_lm_config)
lm["learning_module_args"]["tolerances"] = {f"patch_{i}": tolerance_values}
learning_module_configs[f"learning_module_{i}"] = lmNow we can create the final complete config dictionary.
# The config dictionary for the pretraining experiment.
dist_agent_5lm_2obj_eval = dict(
# Specify monty experiment class and its args.
experiment_class=MontyObjectRecognitionExperiment,
experiment_args=EvalExperimentArgs(
model_name_or_path=model_path,
n_eval_epochs=len(test_rotations),
min_lms_match=3, # Terminate when 3 learning modules makes a decision.
),
# Specify logging config.
logging_config=EvalLoggingConfig(
output_dir=os.path.join(project_dir, model_name),
run_name="eval",
monty_handlers=[BasicCSVStatsHandler],
wandb_handlers=[],
),
# Specify the Monty model. The FiveLLMMontyConfig contains all of the
# sensor module configs and connectivity matrices. We will specify
# evidence-based learning modules and MontyForEvidenceGraphMatching which
# facilitates voting between evidence-based learning modules.
monty_config=FiveLMMontyConfig(
monty_args=MontyArgs(min_eval_steps=20),
monty_class=MontyForEvidenceGraphMatching,
learning_module_configs=learning_module_configs,
motor_system_config=MotorSystemConfigInformedGoalStateDriven(),
),
# Set up the environment and agent.
env_interface_config=FiveLMMountHabitatEnvInterfaceConfig(),
# Set up the evaluation dataloader.
eval_env_interface_class=ED.InformedEnvironmentInterface,
eval_env_interface_args=EnvironmentInterfacePerObjectArgs(
object_names=object_names,
object_init_sampler=PredefinedObjectInitializer(rotations=test_rotations),
),
)Finally, add your experiment to MyExperiments at the bottom of the file:
experiments = MyExperiments(
dist_agent_5lm_2obj_eval=dist_agent_5lm_2obj_eval,
)
CONFIGS = asdict(experiments)Once again, declare your experiment name as part of the MyExperiments dataclass in the benchmarks/configs/names.py file:
@dataclass
class MyExperiments:
dist_agent_5lm_2obj_eval: dictFinally, run the experiment from the benchmarks/ folder.
python run.py -e dist_agent_5lm_2obj_evalLet's have a look at part of the eval_stats.csv file located at ~/tbp/results/monty/projects/dist_agent_5lm_2obj/eval/eval_stats.csv.

Each row corresponds to one learning module during one episode, and so each episode now occupies a 5-row block in the table. On the far right, the primary_target_object indicates the object being recognized. On the far left, the primary_performance column indicates the learning module's success. In episode 0, all LMs correctly decided that the mug was the object being shown. In episode 1, all LMs terminate with correct while LM_1 terminated with correct_mlh (correct most-likely hypothesis). In short, this means that LM_1 had not yet met its evidence thresholds to make a decision, but the right object was its leading candidate. Had LM_1 been able to continue observing the object, it may well have met the threshold needed to make a final decision. However, the episode was terminated as soon as three learning module met the evidence threshold needed to make a decision. We can require that any number of learning modules meet their evidence thresholds by changing the min_lms_match parameter supplied to the EvalExperimentArgs. See here for a more thorough discussion on how learning modules reach terminal conditions and here to learn about how voting works with the evidence LM.
Like in our benchmark experiments, here we have min_lms_match set to 3. Setting this higher requires more steps but reduces the likelihood of incorrect classification. You can try adjusting min_lms_steps and see what effect it has on the number of steps required to reach a decision. In all cases, however, Monty should reach a decision quicker with five sensor modules than with one. This ability to reach a quicker decisions through voting is central to Monty. In our benchmark experiments, 5-LM models perform inference in roughly 1/3 of the steps needed for a single-LM distant agent model and with fewer instances of incorrect classification.
Lastly, note that num_steps is not the same for all learning modules in an episode. This is because one or more of the sensors can sometimes be aimed off to the side of an object. In this case, the off-object sensor module won't relay information downstream, and so its corresponding learning module will skip a step. (See here for more information about steps.) For example, we see that LM_1 in episode 1 only takes 8 steps while the others take 20-30. Since the sensor module connected to LM_1 was positioned higher than the others, we can surmise that sensor modules were aimed relatively high on the object, thereby causing the sensor module connected to LM_1 to be off-object for many of the steps.
Now you've seen how to set up and run a multi-LM models for both pretraining and evaluation. At present, Monty only supports distant agents with multi-LM models because the current infrastructure doesn't support multiple independently moving agents. We plan to support multiple surface-agent systems in the future.
During pretraining, each learning module learns its own object models independently of the other LMs. To visualize the models learned by each LM, create and a script with the code below. The location and name of the script is unimportant so long as it can find and import Monty.
import os
import matplotlib.pyplot as plt
import torch
from tbp.monty.frameworks.utils.plot_utils_dev import plot_graph
# Get path to pretrained model
project_dir = os.path.expanduser("~/tbp/results/monty/projects")
model_name = "dist_agent_5lm_2obj"
model_path = os.path.join(project_dir, model_name, "pretrained/model.pt")
state_dict = torch.load(model_path)
fig = plt.figure(figsize=(8, 3))
for lm_id in range(5):
ax = fig.add_subplot(1, 5, lm_id + 1, projection="3d")
graph = state_dict["lm_dict"][lm_id]["graph_memory"]["mug"][f"patch_{lm_id}"]
plot_graph(graph, ax=ax)
ax.view_init(-65, 0, 0)
ax.set_title(f"LM {lm_id}")
fig.suptitle("Mug Object Models")
fig.tight_layout()
plt.show()After running the script, you should see the following:
There are minor differences in the object models due to the different views each sensor module relayed to its respective learning modules, but each should contain a fairly complete representation of the mug.
This tutorial demonstrates how to configure and run Monty experiments for pretraining. In the next tutorial, we show how to load our pretrained model and use it to perform inference. Though Monty is designed for continual learning and does not require separate training and evaluation modes, this set of experiments is useful for understanding many of our benchmarks experiments.
The pretraining differs from Monty's default learning setup in that it is supervised. Under normal conditions, Monty learns by first trying to recognize an object, and then updating its models depending on the outcome of this recognition (unsupervised). Here, we provide the object name and pose to the model directly, so there is no inference phase required. This provides an experimental condition where we can ensure that the model learns the correct models for each of the objects, and we can focus on the inference phase afterward. Naturally, unsupervised learning provides a more challenging but also more naturalistic learning condition, and we will cover this condition later.
Our model will have one surface agent connected to one sensor module connected to one learning module. For simplicity and speed, we will only train/test on two objects in the YCB dataset.
Note
Don't have the YCB Dataset Downloaded?
You can find instructions for downloading the YCB dataset here. Alternatively, you can run these experiments using the built-in Habitat primitives, such as capsule3DSolid and cubeSolid. Simply change the items in the object_names list.
Monty experiments are defined using a nested dictionary. These dictionaries define the experiment class and associated simulation parameters, logging configs, the Monty model (which includes sensor modules, learning modules, and a motor system), and the environment interface. This is the basic structure of a complete experiment config, along with their expected types:
experiment_class:MontyExperimentManages the highest-level calls to the environment and Monty model.experiment_args:ExperimentArgsArguments supplied to the experiment class.logging_config:LoggingConfigSpecifies which loggers should be used.monty_config:MontyConfigmonty_class:MontyThe type of Monty model to use, e.g. for evidence-based graph matching:MontyForEvidenceGraphMatching.monty_args:MontyArgsArguments supplied to the Monty class.sensor_module_configs:Mapping[str:Mapping]learning_module_configs:Mapping[str:Mapping]motor_system_config:dataclass(e.g.,MotorSystemConfigCurvatureInformedSurface)sm_to_agent_dict: mapping of which sensors connect to which sensor modules.sm_to_lm_matrix: mapping of which sensor modules connect to which learning modules.lm_to_lm_matrix: hierarchical connectivity between learning modules.lm_to_lm_vote_matrix: lateral connectivity between learning modules.
env_interface_config:dataclass(specifies environment interface-related args incl. the environment that the interface wraps around (env_init_func), arguments for initializing this environment, such as the agent and sensor configuration (env_init_args), and optional transformations that occur before information reaches a sensor module. For an example, seeSurfaceViewFinderMountHabitatEnvInterfaceConfig)train_env_interface_class:EnvironmentInterfacetrain_env_interface_args: Specifies how the interface should interact with the environment. For instance, which objects should be shown in what episodes and in which orientations and locations. For example, seeEnvironmentInterfacePerObjectArgseval_env_interface_class:EnvironmentInterfaceeval_env_interface_args: Same purpose astrain_env_interface_argsbut allows for presenting Monty with different conditions between training and evaluation. For example, seeEnvironmentInterfacePerObjectArgs
Most configs come in pairs: a class to instantiate and arguments to instantiate it with. A set of arguments is specified as a Python data class, and Monty has many data classes that simplify setup by defining different sets of default parameters.
To follow along, open the benchmarks/configs/my_experiments.py file and paste the code snippets into it. Let's set up the training experiment. First, we import everything we need and define names and paths.
import os
from dataclasses import asdict
from benchmarks.configs.names import MyExperiments
from tbp.monty.frameworks.config_utils.config_args import (
MontyArgs,
MotorSystemConfigCurvatureInformedSurface,
PatchAndViewMontyConfig,
PretrainLoggingConfig,
get_cube_face_and_corner_views_rotations,
)
from tbp.monty.frameworks.config_utils.make_env_interface_configs import (
EnvironmentInterfacePerObjectArgs,
PredefinedObjectInitializer,
SupervisedPretrainingExperimentArgs,
)
from tbp.monty.frameworks.environments import embodied_data as ED
from tbp.monty.frameworks.experiments import (
MontySupervisedObjectPretrainingExperiment,
)
from tbp.monty.frameworks.models.graph_matching import GraphLM
from tbp.monty.frameworks.models.sensor_modules import (
HabitatSM,
Probe,
)
from tbp.monty.simulators.habitat.configs import (
SurfaceViewFinderMountHabitatEnvInterfaceConfig,
)
"""
Basic setup
-----------
"""
# Specify directory where an output directory will be created.
project_dir = os.path.expanduser("~/tbp/results/monty/projects")
# Specify a name for the model.
model_name = "surf_agent_1lm_2obj"Note
Where Logs and Models are Saved
Loggers have output_dir and run_name parameters, and output will typically be saved to OUTPUT_DIR/RUN_NAME. The PretrainLoggingConfig class is an exception to this as it stores output to OUTPUT_DIR/RUN_NAME/pretrained. We will set up the pretraining logger using project_dir as the output_dir and model_name as the run_name, so the final model will be stored at ~/tbp/results/monty/projects/surf_agent_1lm_2obj/pretrained.
Inference logs will be saved at ~/tbp/results/monty/projects/surf_agent_1lm_2obj/eval.
Next, we specify which objects the model will train on in the dataset, including the rotations in which the objects will be presented. The following code specifies two objects ("mug" and "banana") and 14 unique rotations, which means that both the mug and the banana will be shown 14 times, each time in a different rotation. During each of the overall 28 episodes, the sensors will move over the respective object and collect multiple observations to update the model of the object.
"""
Training
----------------------------------------------------------------------------------------
"""
# Here we specify which objects to learn. 'mug' and 'banana' come from the YCB dataset.
# If you don't have the YCB dataset, replace with names from habitat (e.g.,
# 'capsule3DSolid', 'cubeSolid', etc.).
object_names = ["mug", "banana"]
# Get predefined object rotations that give good views of the object from 14 angles.
train_rotations = get_cube_face_and_corner_views_rotations()The function get_cube_face_and_corner_views_rotations() is used in our pretraining
and many of our benchmark experiments since the rotations it returns provide a good set
of views from all around the object. Its name comes from picturing an imaginary cube
surrounding an object. If we look at the object from each of the cube's faces, we
get 6 unique views that typically cover most of the object's surface. We can also look
at the object from each of the cube's 8 corners which provides an extra set of views
that help fill in any gaps. The 14 rotations provided by
get_cube_face_and_corner_views_rotations will rotate the object as if an observer
were looking at the object from each of the cube's faces and corners like so:
Now we define the entire nested dictionary that specifies one complete Monty experiment:
# The config dictionary for the pretraining experiment.
surf_agent_2obj_train = dict(
# Specify monty experiment and its args.
# The MontySupervisedObjectPretrainingExperiment class will provide the model
# with object and pose labels for supervised pretraining.
experiment_class=MontySupervisedObjectPretrainingExperiment,
experiment_args=SupervisedPretrainingExperimentArgs(
n_train_epochs=len(train_rotations),
),
# Specify logging config.
logging_config=PretrainLoggingConfig(
output_dir=project_dir,
run_name=model_name,
wandb_handlers=[],
),
# Specify the Monty config.
monty_config=PatchAndViewMontyConfig(
monty_args=MontyArgs(num_exploratory_steps=500),
# sensory module configs: one surface patch for training (sensor_module_0),
# and one view-finder for initializing each episode and logging
# (sensor_module_1).
sensor_module_configs=dict(
sensor_module_0=dict(
sensor_module_class=HabitatSM,
sensor_module_args=dict(
is_surface_sm=True,
sensor_module_id="patch",
# a list of features that the SM will extract and send to the LM
features=[
"pose_vectors",
"pose_fully_defined",
"on_object",
"object_coverage",
"rgba",
"hsv",
"min_depth",
"mean_depth",
"principal_curvatures",
"principal_curvatures_log",
"gaussian_curvature",
"mean_curvature",
"gaussian_curvature_sc",
"mean_curvature_sc",
],
save_raw_obs=False,
),
),
sensor_module_1=dict(
sensor_module_class=Probe,
sensor_module_args=dict(
sensor_module_id="view_finder",
save_raw_obs=False,
),
),
),
# learning module config: 1 graph learning module.
learning_module_configs=dict(
learning_module_0=dict(
learning_module_class=GraphLM,
learning_module_args=dict(), # Use default LM args
)
),
# Motor system config: class specific to surface agent.
motor_system_config=MotorSystemConfigCurvatureInformedSurface(),
),
# Set up the environment and agent
env_interface_config=SurfaceViewFinderMountHabitatEnvInterfaceConfig(),
train_env_interface_class=ED.InformedEnvironmentInterface,
train_env_interface_args=EnvironmentInterfacePerObjectArgs(
object_names=object_names,
object_init_sampler=PredefinedObjectInitializer(rotations=train_rotations),
),
)Here, we explicitly specified most parameters in config classes for transparency. The remaining parameters (e.g., sm_to_lm_matrix, etc.) aren't supplied since PatchAndViewMontyConfigs defaults will work fine here. If you use a different number of SMs or LMs or want a custom connectivity between them, you will have to specify those as well.
Briefly, we specified our experiment class and the number of epochs to run. We also configured a logger and a training environment interface to initialize our objects at different orientations for each episode. monty_config is a nested config that describes the complete sensorimotor modeling system. Here is a short breakdown of its components:
PatchAndViewMontyConfig: the top-level Monty config object specifies that we will have a sensor patch and an additional viewfinder as inputs to the system. It also specifies the routing matrices between sensors, SMs and LMs (using defaults in this simple setup).monty_args: aMontyArgsobject specifying we want 500 exploratory steps per episode.sensor_module_configs: a dictionary specifying sensor module class and arguments. These dictionaries specify thatsensor_module_0will be aHabitatSMwithis_surface_sm=True(a small sensory patch for a surface agent). The sensor module will extract the given list of features for each patch. We won't save raw observations here since it is memory-intensive and only required for detailed logging/plotting.sensor_module_1will be aProbewhich we can use for logging. We could also store raw observations from the viewfinder for later visualization/analysis if needed. This sensor module is not connected to a learning module and, therefore, is not used for learning. It is calledview_findersince it helps initialize each episode on the object.
learning_module_configs: a dictionary specifying the learning module class and arguments. This dictionary specifies thatlearning_module_0will be aGraphLMthat constructs a graph of the object being explored.
motor_system_config: aMotorSystemConfigCurvatureInformedSurfaceconfig object that specifies a motor policy class to use. This policy here will move orthogonal to the surface of the object with a preference of following principal curvatures that are sensed. When doing pretraining with the distant agent, theMotorSystemConfigNaiveScanSpiralpolicy is recommended since it ensures even coverage of the object from the available view point.
To get an idea of what each sensor module sees and the information passed on to the learning module, check out the documentation on observations, transforms, and sensor modules. To learn more about how learning modules construct object graphs from sensor output, refer to the graph building documentation.
Finally, add your experiment to MyExperiments at the bottom of the file:
experiments = MyExperiments(
surf_agent_2obj_train=surf_agent_2obj_train,
)
CONFIGS = asdict(experiments)Next you will need to declare your experiment name as part of the MyExperiments dataclass in the benchmarks/configs/names.py file:
@dataclass
class MyExperiments:
surf_agent_2obj_train: dictTo run this experiment, navigate to the benchmarks/ folder in a terminal and call the run.py script with the experiment name as the -e argument.
cd benchmarks
python run.py -e surf_agent_2obj_trainThis will take a few minutes to complete and then you can inspect and visualize the learned models. To do so, create a script and paste in the following code. The location and name of the script is unimportant, but we called it pretraining_tutorial_analysis.py and placed it outside of the repository at ~/monty_scripts.
import os
import matplotlib.pyplot as plt
from tbp.monty.frameworks.utils.logging_utils import load_stats
from tbp.monty.frameworks.utils.plot_utils_dev import plot_graph
# Specify where pretraining data is stored.
exp_path = os.path.expanduser("~/tbp/results/monty/projects/surf_agent_1lm_2obj")
pretrained_dict = os.path.join(exp_path, "pretrained")
train_stats, eval_stats, detailed_stats, lm_models = load_stats(
exp_path,
load_train=False, # doesn't load train csv
load_eval=False, # doesn't try to load eval csv
load_detailed=False, # doesn't load detailed json output
load_models=True, # loads models
pretrained_dict=pretrained_dict,
)
# Visualize the mug graph from the pretrained graphs loaded above from
# pretrained_dict. Replace 'mug' with 'banana' to plot the banana graph.
plot_graph(lm_models["pretrained"][0]["mug"]["patch"], rotation=120)
plt.show()Replace "mug" with "banana" in the second to last line to visualize the banana's graph. After running the script, you should see a graph of the mug/banana.
See logging and analysis for more detailed information about experiment logs and how to work with them. You can now move on to part two of this tutorial where we load our pretrained model and use it for inference.
If you have your own repository and want to run your own experiment or a benchmark, you do not need to replicate the tbp.monty benchmarks setup.
Note
We have a tbp.monty_project_template template repository, so that you can quickly use tbp.monty for your project, prototype, or paper.
You have the option of running everything from a single script file. The general setup is:
from tbp.monty.frameworks.run_env import setup_env
# call setup_env() to initialize environment used by
# tbp.monty configuration and runtime.
setup_env()
# imports from tbp.monty for use in your configuration
# import run or run_parallel to run the experiment
from tbp.monty.frameworks.run import run # noqa: E402
from tbp.monty.frameworks.run_parallel import run_parallel # noqa: E402
experiment_config = # your configuration
run(experiment_config)
# or
run_parallel(experiment_config)A more filled out example:
import os
from tbp.monty.frameworks.run_env import setup_env
setup_env()
from tbp.monty.frameworks.config_utils.config_args import ( # noqa: E402
LoggingConfig,
PatchAndViewMontyConfig,
)
from tbp.monty.frameworks.config_utils.make_env_interface_configs import ( # noqa: E402
SupervisedPretrainingExperimentArgs,
get_env_interface_per_object_by_idx,
)
from tbp.monty.frameworks.environments import embodied_data as ED # noqa: E402
from tbp.monty.frameworks.experiments.pretraining_experiments import ( # noqa: E402
MontySupervisedObjectPretrainingExperiment,
)
from tbp.monty.frameworks.run import run # noqa: E402
from tbp.monty.simulators.habitat.configs import ( # noqa: E402
PatchViewFinderMountHabitatEnvInterfaceConfig,
)
first_experiment = dict(
experiment_class=MontySupervisedObjectPretrainingExperiment,
logging_config=LoggingConfig(
log_parallel_wandb=False,
run_name="test",
output_dir=os.path.expanduser(
os.path.join(os.getenv("MONTY_LOGS"), "projects/monty_runs/test")
),
),
experiment_args=SupervisedPretrainingExperimentArgs(
do_eval=False,
max_train_steps=1,
n_train_epochs=1,
),
monty_config=PatchAndViewMontyConfig(),
env_interface_config=PatchViewFinderMountHabitatEnvInterfaceConfig(),
train_env_interface_class=ED.EnvironmentInterfacePerObject,
train_env_interface_args=get_env_interface_per_object_by_idx(start=0, stop=1),
)
run(first_experiment)This tutorial is a follow-up of our tutorial on pretraining a model. Here we will load the pretraining data into the model and perform object recognition under noisy conditions and several arbitrary object rotations.
Note
The first part of this tutorial must be completed for the code in this tutorial to run.
To follow along, open the benchmarks/configs/my_experiments.py file and paste the code snippets into it.
import os
from dataclasses import asdict
import numpy as np
from benchmarks.configs.names import MyExperiments
from tbp.monty.frameworks.config_utils.config_args import (
EvalLoggingConfig,
MontyArgs,
MotorSystemConfigCurInformedSurfaceGoalStateDriven,
PatchAndViewSOTAMontyConfig,
)
from tbp.monty.frameworks.config_utils.make_env_interface_configs import (
EnvironmentInterfacePerObjectArgs,
EvalExperimentArgs,
PredefinedObjectInitializer,
)
from tbp.monty.frameworks.environments import embodied_data as ED
from tbp.monty.frameworks.experiments import (
MontyObjectRecognitionExperiment,
)
from tbp.monty.frameworks.models.evidence_matching.learning_module import (
EvidenceGraphLM
)
from tbp.monty.frameworks.models.goal_state_generation import (
EvidenceGoalStateGenerator,
)
from tbp.monty.frameworks.models.sensor_modules import (
HabitatSM,
Probe,
)
from tbp.monty.simulators.habitat.configs import (
SurfaceViewFinderMountHabitatEnvInterfaceConfig,
)
"""
Basic setup
-----------
"""
# Specify the directory where an output directory will be created.
project_dir = os.path.expanduser("~/tbp/results/monty/projects")
# Specify the model name. This needs to be the same name as used for pretraining.
model_name = "surf_agent_1lm_2obj"
# Where to find the pretrained model.
model_path = os.path.join(project_dir, model_name, "pretrained")
# Where to save eval logs.
output_dir = os.path.join(project_dir, model_name)
run_name = "eval"Now we specify that we want to test the model on "mug" and "banana", and that we want the objects to be rotated a few different ways.
# Specify objects to test and the rotations in which they'll be presented.
object_names = ["mug", "banana"]
test_rotations = [
np.array([0.0, 15.0, 30.0]),
np.array([7.0, 77.0, 2.0]),
np.array([81.0, 33.0, 90.0]),
]Since this config is going to be a bit more complex, we will build it up in pieces. Here is the configuration for sensor modules.
# Let's add some noise to the sensor module outputs to make the task more challenging.
sensor_noise_params = dict(
features=dict(
pose_vectors=2, # rotate by random degrees along xyz
hsv=np.array([0.1, 0.2, 0.2]), # add noise to each channel (the values here specify std. deviation of gaussian for each channel individually)
principal_curvatures_log=0.1,
pose_fully_defined=0.01, # flip bool in 1% of cases
),
location=0.002, # add gaussian noise with 0.002 std (0.2cm)
)
sensor_module_0 = dict(
sensor_module_class=HabitatSM,
sensor_module_args=dict(
sensor_module_id="patch",
# Features that will be extracted and sent to LM
# note: don't have to be all the features extracted during pretraining.
features=[
"pose_vectors",
"pose_fully_defined",
"on_object",
"object_coverage",
"min_depth",
"mean_depth",
"hsv",
"principal_curvatures",
"principal_curvatures_log",
],
save_raw_obs=False,
# HabitatSM will only send an observation to the LM if features or location
# changed more than these amounts.
delta_thresholds={
"on_object": 0,
"n_steps": 20,
"hsv": [0.1, 0.1, 0.1],
"pose_vectors": [np.pi / 4, np.pi * 2, np.pi * 2],
"principal_curvatures_log": [2, 2],
"distance": 0.01,
},
is_surface_sm=True, # for surface agent
noise_params=sensor_noise_params,
),
)
sensor_module_1 = dict(
sensor_module_class=Probe,
sensor_module_args=dict(
sensor_module_id="view_finder",
save_raw_obs=False,
),
)
sensor_module_configs = dict(
sensor_module_0=sensor_module_0,
sensor_module_1=sensor_module_1,
)There are two main differences between this config and the pretraining sensor module config. First, we are adding some noise to the sensor patch, so we define noise parameters and add them to sensor_module_0's dictionary. Second, we're using delta_threshold parameters to only send an observation to the learning module if the features have changed significantly. Note that HabitatSM can be used as either a surface or distant agent, for which is_surface_sm should be appropriately set.
For the learning module, we specify
# Tolerances within which features must match stored values in order to add evidence
# to a hypothesis.
tolerances = {
"patch": {
"hsv": np.array([0.1, 0.2, 0.2]),
"principal_curvatures_log": np.ones(2),
}
}
# Features where weight is not specified default to 1.
feature_weights = {
"patch": {
# Weighting saturation and value less since these might change under different
# lighting conditions.
"hsv": np.array([1, 0.5, 0.5]),
}
}
learning_module_0 = dict(
learning_module_class=EvidenceGraphLM,
learning_module_args=dict(
# Search the model in a radius of 1cm from the hypothesized location on the model.
max_match_distance=0.01, # =1cm
tolerances=tolerances,
feature_weights=feature_weights,
# Most likely hypothesis needs to have 20% more evidence than the others to
# be considered certain enough to trigger a terminal condition (match).
x_percent_threshold=20,
# Update all hypotheses with evidence > 80% of the max hypothesis evidence
evidence_threshold_config="80%",
# Config for goal state generator of LM which is used for model-based action
# suggestions, such as hypothesis-testing actions.
gsg_class=EvidenceGoalStateGenerator,
gsg_args=dict(
# Tolerance(s) when determining goal-state success
goal_tolerances=dict(
location=0.015, # distance in meters
),
# Number of necessary steps for a hypothesis-testing action to be considered
min_post_goal_success_steps=5,
),
hypotheses_updater_args=dict(
# Look at features associated with (at most) the 10 closest learned points.
max_nneighbors=10,
)
),
)
learning_module_configs = dict(learning_module_0=learning_module_0)Since this learning module will be performing graph matching, we want to specify further parameters that will define various thresholds and weights to be given to different features. We're also using the EvidenceGraphLM rather than the GraphLM which keeps a continuous evidence count for all its hypotheses and is currently the best-performing LM in this codebase.
We then integrate these sensor and learning module configs into the overall experiment config.
# The config dictionary for the evaluation experiment.
surf_agent_2obj_eval = dict(
# Set up experiment
experiment_class=MontyObjectRecognitionExperiment,
experiment_args=EvalExperimentArgs(
model_name_or_path=model_path, # load the pre-trained models from this path
n_eval_epochs=len(test_rotations),
max_total_steps=5000,
show_sensor_output=True, # live visualization of Monty's observations and MLH
),
logging_config=EvalLoggingConfig(
output_dir=output_dir,
run_name=run_name,
wandb_handlers=[], # remove this line if you, additionally, want to log to WandB.
),
# Set up monty, including LM, SM, and motor system.
monty_config=PatchAndViewSOTAMontyConfig(
monty_args=MontyArgs(min_eval_steps=20),
sensor_module_configs=sensor_module_configs,
learning_module_configs=learning_module_configs,
motor_system_config=MotorSystemConfigCurInformedSurfaceGoalStateDriven(),
),
# Set up environment/data
env_interface_config=SurfaceViewFinderMountHabitatEnvInterfaceConfig(),
eval_env_interface_class=ED.InformedEnvironmentInterface,
eval_env_interface_args=EnvironmentInterfacePerObjectArgs(
object_names=object_names,
object_init_sampler=PredefinedObjectInitializer(rotations=test_rotations),
),
# Doesn't get used, but currently needs to be set anyways.
train_env_interface_class=ED.InformedEnvironmentInterface,
train_env_interface_args=EnvironmentInterfacePerObjectArgs(
object_names=object_names,
object_init_sampler=PredefinedObjectInitializer(rotations=test_rotations),
),
)Note that we have changed the Monty experiment class and the logging config. We also opted for a policy whereby learning-modules generate actions to test hypotheses by producing "goal-states" for the low-level motor system (MotorSystemConfigCurInformedSurfaceGoalStateDriven). Additionally, we are now initializing the objects at test_rotations instead of train_rotations.
Finally, add your experiment to MyExperiments at the bottom of the file:
experiments = MyExperiments(
surf_agent_2obj_eval=surf_agent_2obj_eval,
)
CONFIGS = asdict(experiments)Next you will need to declare your experiment name as part of the MyExperiments dataclass in the benchmarks/configs/names.py file:
@dataclass
class MyExperiments:
surf_agent_2obj_eval: dictTo run the experiment, navigate to the benchmarks/ folder and call the run.py script with an experiment name as the -e argument.
cd benchmarks
python run.py -e surf_agent_2obj_evalOnce the run is complete, you can inspect the inference logs located in ~/tbp/results/monty/projects/surf_agent_1lm_2obj/eval. Since EvalLoggingConfig includes a CSV-logger, you should be able to open eval_stats.csv and find 6 rows (one for each episode) detailing whether the object was correctly identified, the number of steps required in an episode, etc.
You now know how to pretrain a model and use it to perform inference. In our next tutorial, we will demonstrate how to use Monty for unsupervised continual learning.
In this tutorial we will introduce the basic mechanics of Monty experiment configs, how to run them, and what happens during the execution of a Monty experiment. Since we will focus mainly on the execution of an experiment, we'll configure and run the simplest possible experiment and walk through it step-by-step. Please have a look at the next tutorials for more concrete examples of running the code with our current graph learning approach.
Note
Don't have the YCB Dataset Downloaded?
Most of our tutorials require the YCB dataset, including this one. Please follow the instructions on downloading it here.
Note
Below instructions assume you'll be running an experiment within the checked out tbp.monty repository. This is the recommended way to start. Once you are familiar with Monty, if you'd rather setup your experiment in your own repository, then take a look at Running An Experiment From A Different Repository.
To follow along, copy this code into the benchmarks/configs/my_experiments.py file.
from dataclasses import asdict
from benchmarks.configs.names import MyExperiments
from tbp.monty.frameworks.config_utils.config_args import (
LoggingConfig,
PatchAndViewMontyConfig,
)
from tbp.monty.frameworks.config_utils.make_env_interface_configs import (
get_env_interface_per_object_by_idx,
SupervisedPretrainingExperimentArgs,
)
from tbp.monty.frameworks.environments import embodied_data as ED
from tbp.monty.frameworks.experiments.pretraining_experiments import (
MontySupervisedObjectPretrainingExperiment,
)
from tbp.monty.simulators.habitat.configs import (
PatchViewFinderMountHabitatEnvInterfaceConfig,
)
#####
# To test your env and help you familiarize yourself with the code, we'll run the simplest possible
# experiment. We'll use a model with a single learning module as specified in
# monty_config. We'll also skip evaluation, train for a single epoch for a single step,
# and only train on a single object, as specified in experiment_args and train_env_interface_args.
#####
first_experiment = dict(
experiment_class=MontySupervisedObjectPretrainingExperiment,
logging_config=LoggingConfig(),
experiment_args=SupervisedPretrainingExperimentArgs(
do_eval=False,
max_train_steps=1,
n_train_epochs=1,
),
monty_config=PatchAndViewMontyConfig(),
# Set up the environment and agent.
env_interface_config=PatchViewFinderMountHabitatEnvInterfaceConfig(),
train_env_interface_class=ED.EnvironmentInterfacePerObject,
train_env_interface_args=get_env_interface_per_object_by_idx(start=0, stop=1),
)
experiments = MyExperiments(
first_experiment=first_experiment,
)
CONFIGS = asdict(experiments)Next you will need to declare your experiment name as part of the MyExperiments dataclass in the benchmarks/configs/names.py file:
@dataclass
class MyExperiments:
first_experiment: dictTo run this experiment you just defined, you can now simply navigate to the benchmarks/ folder and call the run.py script with the experiment name as the -e argument.
cd benchmarks
python run.py -e first_experimentNow that you have run your first experiment, let's unpack what happened. This first section involves a lot of text, but rest assured, once you grok this first experiment, the rest of the tutorials will be much more interactive and will focus on running experiments and using tooling. This first experiment is virtually the simplest one possible, but it is designed to familiarize you with all the pieces and parts of the experimental workflow to give you a good foundation for further experimentation.
Experiments are implemented as Python classes with methods like train and evaluate. In essence, run.py loads a config and calls train and evaluate methods if the config says to run them. Notice that first_experiment has do_eval set to False, so run.py will only run the train method.
One epoch will run training (or evaluation) on all the specified objects. An epoch generally consists of multiple episodes, one for each object, or for each pose of an object in the environment. An episode is one training or evaluating session with one single object. This episode consists of a sequence of steps. What happens in a step depends on the particular experiment, but an example would be: shifting the agent's position, reading sensor inputs, transforming sensor inputs to features, and adding these features to an object model. For more details on this default experiment setup see this section from the Monty documentation.
If you examine the MontyExperiment class, the parent class of MontySupervisedObjectPretrainingExperiment, you will notice that there are related methods like {pre,post}_epoch, and {pre,post}_episode. With inheritance or mixin classes, you can use these methods to customize what happens before during and after each epoch, or episode. For example, MontySupervisedObjectPretrainingExperiment reimplements pre_episode and post_epoch to provide extra functionality specific to pretraining experiments. Also notice that each method contains calls to a logger. Logger classes can also be customized to log specific information at each control point. Finally, we save a model with the save_state_dict method at the end of each epoch. All told, the sequence of method calls goes something like
MontyExperiment.train(loops over epochs)- Do pre-train logging.
MontyExperiment.run_epoch(loops over episodes)MontyExperiment.pre_epoch- Do pre-epoch logging.
MontyExperiment.run_episode(loops over steps)MontyExperiment.pre_episode- Do pre-episode logging.
Monty.stepMontyExperiment.post_episode- Update object model in memory.
- Do post-episode logging
MontyExperiment.post_epochMontyExperiment.save_state_dict.- Do post-epoch logging.
- Do post-train logging.
and this is exactly the procedure that was executed when you ran python run.py -e first_experiment. (Please note that we're writing MontyExperiment in the above sequence rather than MontySupervisedObjectPretrainingExperiment for the sake of generality). When we run Monty in evaluation mode, the same sequence of calls is initiated by MontyExperiment.evaluate minus the model updating step in MontyExperiment.post_episode. See here for more details on epochs, episodes, and steps.
The model is specified in the monty_config field of the first_experiment config as a PatchAndViewMontyConfig which is in turn defined within src/tbp/monty/frameworks/config_utils/config_args.py. Yes, that's a config within a config. The reason for nesting configs is that the model is an ensemble of LearningModules (LMs), and SensorModules (SMs), each of which could potentially have their own configuration as well. For more details on configuring custom learning or sensor modules see this guide.
For now, we will start with one of the simpler and most common versions of this complex system. The PatchAndViewMontyConfig dataclass has fields learning_module_configs and sensor_module_configs where each key is the name of an LM (or SM resp.), and each value is the full config for that model component. Our first model has one LM and two SMs. Why two SMs and only 1 LM? One SM provides the LM with processed observations, while the second SM is used solely to initialize the agent at the beginning of the experiment.
Note that the sm_to_agent_dict field of the model config maps each SM to an "agent" (i.e. a moveable part), and only a single agent is specified, meaning that our model has one moveable part with one sensor attached to it. In particular, it has an RGBD camera attached to it.
By now, we know that an experiment relies on train and evaluate methods, that each of these runs one or more epochs, which consists of one or more episodes, and finally each episode repeatedly calls model.step. Now we will start unpacking each of these levels, starting with the innermost loop over steps.
In PatchAndViewMontyConfig, notice that the model class is specified as MontyForGraphMatching (src/tbp/monty/frameworks/models/graph_matching.py), which is a subclass of MontyBase defined in src/tbp/monty/frameworks/models/monty_base.py, which in turn is a subclass of Monty, an abstract base class defined in src/tbp/monty/frameworks/models/abstract_monty_classes.py. In the abstract base class Monty, you will see that there are two template methods for two types of steps: _exploratory_step and _matching_step. In turn, each of these steps is defined as a sequence of calls to other abstract methods, including _set_step_type_and_check_if_done, which is a point at which the step type can be switched. The conceptual difference between these types of steps is that during exploratory steps, no inference is attempted, which means no voting and no keeping track of which objects or poses are possible matches to the current observation. Each time model.step is called in the experimental procedure listed under the "Episodes and Epochs" heading, either _exploratory_step or _matching_step will be called. In a typical experiment, training consists of running _matching_step until a) an object is recognized, or b) all known objects are ruled out, or c) a step counter exceeds a threshold. Regardless of how matching-steps is terminated, the system then switches to running exploratory step so as to gather more observations and build a more complete model of an object.
You can, of course, customize step types and when to switch between step types by defining subclasses or mixins. To set the initial step type, use model.pre_episode. To adjust when and how to switch step types, use _set_step_type_and_check_if_done.
In this particular experiment, n_train_epochs was set to 1, and max_train_steps was set to 1. This means a single epoch was run, with one matching step per episode. In the next section, we go up a level from the model step to understand episodes and epochs.
In the config for first_experiment, there is a comment that marks the start of environment and agent setup. Now we turn our attention to everything below that line, as this is where episode specifics are defined.
The environment interface class is the way we interact with a simulation environment. The objects within an environment are assumed to be the same for both training and evaluation (for now), hence only one (class, args) pairing is needed. Note however that object orientations, as well as specific observations obtained from an object, will generally differ across training and evaluation.
The environment interface is basically the API between the environment and the model. Its job is to sample from the environment and return observations to the model (+initialize and reset the environment). Note that the next observation is decided by the last action, and the actions are selected by a motor_system. This motor system is shared by reference with the model. By changing the actions, the model controls what it observes next, just as you would expect from a sensorimotor system.
Now, finally answering our question of what happens in an episode, notice that our config uses a special type of environment interface: EnvironmentInterfacePerObject (note that this is a subclass of EnvironmentInterface which is kept as general as possible to allow for flexible subclass customization). As indicated in the docstring, this environment interface has a list of objects, and at the beginning / end of an episode, it removes the current object from the environment, increments a (cyclical) counter that determines which object is next, and places the new object in the environment. The arguments to EnvironmentInterfacePerObject determine which objects are added to the environment and in what pose. In our config, we use a single list with one YCB object. As shown by this line train_env_interface_args=get_env_interface_per_object_by_idx(start=0, stop=1),
To wrap up this tutorial, we'll cover a few more details of the model. Recall that sm_to_agent_dict assigns each SM to a moveable part (i.e. an "agent"). The action space for each moveable part is in turn defined in the motor_system_config part of the model config. Once an action is executed, the agent moves, and each sensor attached to that agent (here just a single RGBD sensor) receives an observation. Just as sm_to_agent_dict specifies which sensors are attached to which agents, in src/tbp/monty/frameworks/config_utils/config_args the MontyConfig field sm_to_lm_matrix specifies for each LM which SMs it will receive observations from. Thus, observations flow from agents to sensors (SMs), and from SMs to LMs, where all actual modeling takes place in the LM. Near the end of model.step (remember, this can be either matching_step or exploratory_step), the model calls decide_location_for_movement which selects actions and closes the loop between the model and the environment. Finally, at the end of each epoch, we save a model in a directory specified by the ExperimentArgs field of the model config.
That was a lot of text, so let's review what all went into this experiment.
- We ran a
MontyExperimentusingrun.py - We went through the
trainprocedure with one epoch - The epoch looped over a list of objects of length 1 - so a single episode was run
- The max steps was set to 1, so all told, we took one single step on one single object
- Our model had a single agent with a single RGBD camera attached to it
- During
model.step,matching_stepwas called and one SM received one observation from the environment - The
decide_location_for_movementmethod was called - We saved our model at the end of the epoch
Congratulations on completing your first experiment! Ready to take the next step? Learn the ins-and-outs of pretraining a model.
This tutorial demonstrates how to configure and run Monty experiments for unsupervised continual learning. In this regime, Monty learns while it explores an object and attempts to identify its identity and pose. If an object has been recognized as a previously seen item, then any knowledge gained during its exploration is added to an existing model and committed to memory. If Monty does not recognize the object, then a new model is generated for the object and stored. In this way, Monty jointly performs unsupervised learning and object/pose recognition. This mode of operation is distinct from those used in our tutorials on pretraining and model evaluation in which learning and inference were performed in separate stages which is useful for more controlled experiments on one of the two. It is closer to the ultimate vision of Monty, where learning and inference are closely intertwined, much as they are in humans.
Our model will have one surface agent connected to one sensor module connected to one learning module. Our dataset will be comprised of two objects in the YCB dataset, and each will be shown at a different random rotation for each episode. Monty will see each object three times in total.
📘 Don't have the YCB Dataset Downloaded? You can find instructions for downloading the YCB dataset here.
To follow along, open the benchmarks/configs/my_experiments.py file and paste the code snippets into it.
import os
from dataclasses import asdict
import numpy as np
from benchmarks.configs.names import MyExperiments
from tbp.monty.frameworks.config_utils.config_args import (
CSVLoggingConfig,
MontyArgs,
SurfaceAndViewMontyConfig,
)
from tbp.monty.frameworks.config_utils.make_env_interface_configs import (
EnvironmentInterfacePerObjectArgs,
ExperimentArgs,
RandomRotationObjectInitializer,
)
from tbp.monty.frameworks.environments import embodied_data as ED
from tbp.monty.frameworks.experiments import (
MontyObjectRecognitionExperiment,
)
from tbp.monty.frameworks.models.evidence_matching.learning_module import (
EvidenceGraphLM,
)
from tbp.monty.simulators.habitat.configs import (
SurfaceViewFinderMountHabitatEnvInterfaceConfig,
)
"""
Basic setup
-----------
"""
# Specify directory where an output directory will be created.
project_dir = os.path.expanduser("~/tbp/results/monty/projects")
# Specify a name for the model.
model_name = "surf_agent_2obj_unsupervised"
# Here we specify which objects to learn. We are going to use the mug and bowl
# from the YCB dataset.
object_names = ["mug", "bowl"]We now set up the learning module configs.
# Set up config for an evidence graph learning module.
learning_module_0 = dict(
learning_module_class=EvidenceGraphLM,
learning_module_args=dict(
max_match_distance=0.01,
# Tolerances within which features must match stored values in order to add
# evidence to a hypothesis.
tolerances={
"patch": {
"hsv": np.array([0.05, 0.1, 0.1]),
"principal_curvatures_log": np.ones(2),
}
},
feature_weights={
"patch": {
# Weighting saturation and value less since these might change
# under different lighting conditions.
"hsv": np.array([1, 0.5, 0.5]),
}
},
x_percent_threshold=20,
# Thresholds to use for when two points are considered different enough to
# both be stored in memory.
graph_delta_thresholds=dict(
patch=dict(
distance=0.01,
pose_vectors=[np.pi / 8, np.pi * 2, np.pi * 2],
principal_curvatures_log=[1.0, 1.0],
hsv=[0.1, 1, 1],
)
),
# object_evidence_th sets a minimum threshold on the amount of evidence we have
# for the current object in order to converge; while we can also set min_steps
# for the experiment, this puts a more stringent requirement that we've had
# many steps that have contributed evidence.
object_evidence_threshold=100,
# Symmetry evidence (indicating possibly symmetry in rotations) increments a lot
# after 100 steps and easily reaches the default required evidence. The below
# parameter value partially addresses this, altough we note these are temporary
# fixes and we intend to implement a more principled approach in the future.
required_symmetry_evidence=20,
hypotheses_updater_args=dict(
max_nneighbors=5
)
),
)
learning_module_configs = dict(learning_module_0=learning_module_0)Now we define the full experiment config which will include our learning module config.
# The config dictionary for the unsupervised learning experiment.
surf_agent_2obj_unsupervised = dict(
# Set up unsupervised experiment.
experiment_class=MontyObjectRecognitionExperiment,
experiment_args=ExperimentArgs(
# Not running eval here. The only difference between training and evaluation
# is that during evaluation, no models are updated.
do_eval=False,
n_train_epochs=3,
max_train_steps=2000,
max_total_steps=5000,
),
logging_config=CSVLoggingConfig(
python_log_level="INFO",
output_dir=project_dir,
run_name=model_name,
),
# Set up monty, including LM, SM, and motor system. We will use the default
# sensor modules (1 habitat surface patch, one logging view finder), motor system,
# and connectivity matrices given by `SurfaceAndViewMontyConfig`.
monty_config=SurfaceAndViewMontyConfig(
# Take 1000 exploratory steps after recognizing the object to collect more
# information about it. Require at least 100 steps before recognizing an object
# to avoid early misclassifications when we have few objects in memory.
monty_args=MontyArgs(num_exploratory_steps=1000, min_train_steps=100),
learning_module_configs=learning_module_configs,
),
# Set up the environment and agent.
env_interface_config=SurfaceViewFinderMountHabitatEnvInterfaceConfig(),
train_env_interface_class=ED.InformedEnvironmentInterface,
train_env_interface_args=EnvironmentInterfacePerObjectArgs(
object_names=object_names,
object_init_sampler=RandomRotationObjectInitializer(),
),
# Doesn't get used, but currently needs to be set anyways.
eval_env_interface_class=ED.InformedEnvironmentInterface,
eval_env_interface_args=EnvironmentInterfacePerObjectArgs(
object_names=object_names,
object_init_sampler=RandomRotationObjectInitializer(),
),
)If you have read our previous tutorials on pretraining or running inference with a pretrained model, you may spot a few differences in this setup. For pretraining, we used the MontySupervisedObjectPretrainingExperiment class which also performs training (and not evaluation). While that was a training-only setup, it is different from our unsupervised continual learning config since it supplies object labels to learning modules. For running inference with a pretrained model, we used the MontyObjectRecognitionExperiment class but specified that we only wanted to perform evaluation (i.e., do_train=False and do_eval=True). In contrast, here we used the MontyObjectRecognitionExperiment with arguments do_train=True and do_eval=False. This combination of experiment class and do_train/do_eval arguments is specific to unsupervised continual learning. We have also increased min_training_steps, object_evidence_threshold, and required_symmetry_evidence to avoid early misclassification when there are fewer objects in memory.
Besides these crucial changes, we have also made a few minor adjustments to simplify the rest of the configs. First, we did not explicitly define our sensor module or motor system configs. This is because we are using SurfaceAndViewMontyConfig's default sensor modules, motor system, and matrices that define connectivity between agents, sensors, and learning modules. Second, we are using a RandomRotationObjectInitializer which randomly rotates an object at the beginning of each episode rather than rotating an object by a specific user-defined rotation. Third, we are using the CSVLoggingConfig. This is equivalent to setting up a base LoggingConfig and specifying that we only want a BasicCSVStatsHandler, but it's a bit more succinct. Monty has many config classes provided for this kind of convenience.
Finally, add your experiment to MyExperiments at the bottom of the file:
experiments = MyExperiments(
surf_agent_2obj_unsupervised=surf_agent_2obj_unsupervised,
)
CONFIGS = asdict(experiments)Next you will need to declare your experiment name as part of the MyExperiments dataclass in the benchmarks/configs/names.py file:
@dataclass
class MyExperiments:
surf_agent_2obj_unsupervised: dictTo run this experiment, navigate to the benchmarks/ folder in a terminal and call the run.py script with an experiment name as the -e argument like so:
cd benchmarks
python run.py -e surf_agent_2obj_unsupervisedOnce complete, you can inspect the simulation results and visualize the learned models. The logs are located at ~/tbp/results/monty/projects/surf_agent_2obj_unsupervised. Open train_stats.csv, and you should see a table with six rows--one row for each episode. The first 7 columns should look something like this:
During epoch 0, Monty saw the mug and bowl objects for the first time. Since the observations collected during these episodes do not match any objects stored in memory, Monty reports that no_match has been found in the primary_performance_column. When no match has been found, Monty creates a new object ID for the unrecognized object and stores the learned model in memory. In the result column, we see that the new IDs for the mug and bowl are new_object_0 and new_object_1, respectively. The objects don't have meaningful names, since no labels are provided.
In all subsequent episodes, Monty correctly identified the objects as indicated by correct in the primary_performance column. When Monty finds a match for an object in its memory, it does not simply terminate the episode. Instead, it continues to explore the object for num_exploratory_steps steps and then updates the object model with the new information collected during the episode.
Note that Monty receives minimal information about when a new epoch has started, with the only indication of this being that evidence scores are reset to 0 (an assumption which we intend to relax in the future). If a new object was introduced in the second or third epoch it should again detect no_match and learn a new model for this object. Also note that for logging purposes we save which object was sensed during each episode and what model was updated or associated with this object. Monty has no access to this information. It can happen that multiple objects are merged into one object or that multiple models are learned for one object. This is tracked in mean_objects_per_graph and mean_graphs_per_object in the .csv statistics as well as in possible_match_sources for each model to calculate whether the performance was correct or 'confused'.
We can visualize how models are acquired and refined by plotting an object's model after different epochs. To do so, create a script and paste in the following code. The name and location of the script is arbitrary, but we called it unsupervised_learning_analysis.py and placed it at ~/monty_scripts.
import os
import matplotlib.pyplot as plt
import torch
from tbp.monty.frameworks.models.object_model import GraphObjectModel
from tbp.monty.frameworks.utils.plot_utils_dev import plot_graph
def load_graph(exp_dir: str, epoch: int, object_name: str) -> GraphObjectModel:
load_lm_models = {}
model_path = os.path.join(exp_dir, str(epoch), "model.pt")
state_dict = torch.load(model_path)
return state_dict["lm_dict"][0]["graph_memory"][object_name]["patch"]
exp_dir = os.path.expanduser(
"~/tbp/results/monty/projects/surf_agent_2obj_unsupervised"
)
n_epochs = 3
obj_name = "new_object1" # The generated object ID corresponding to the bowl
# Load object graphs
graphs = [load_graph(exp_dir, epoch, obj_name) for epoch in range(n_epochs)]
fig = plt.figure(figsize=(8, 3))
for epoch in range(n_epochs):
ax = fig.add_subplot(1, n_epochs, epoch + 1, projection="3d")
plot_graph(graphs[epoch], ax=ax)
ax.view_init(137, 40, 0)
ax.set_title(f"epoch {epoch}")
fig.suptitle("Bowl Object Models")
fig.tight_layout()
plt.show()After running this script, you should see a plot with three views of the bowl object model at each epoch like so:
After Monty's first pass (epoch 0), Monty's internal model of the bowl is somewhat incomplete. As Monty continues to encounter the bowl in subsequent epochs, the model becomes further refined by new observations. In this way, Monty learns about new and existing object continuously without the need for supervision or distinct training periods. To update existing models, the detected pose is used to transform new observations into the existing models reference frame. For more details on how learning modules operate in this mode, see our documentation on how learning modules work.
Thus far we have demonstrated how to build models that use a single sensor module connected to one learning module. In the next tutorial, we will demonstrate how to configure multiple sensors and connect them to multiple learning modules to perform faster inference.
❗️ This is an Advanced Tutorial If you've arrived at this page and you're relatively new to Monty, then we would recommend you start by reading some of our other documentation first. Once you're comfortable with the core concepts of Monty, then we think you'll enjoy learning about how to apply it to robotics in the following tutorial!
As Monty is a sensorimotor learning system, robotics is a large application area that naturally comes to mind. This tutorial explains in more detail how Monty can be used for robotics applications outside a simulator. It builds on the previous tutorial, so if you haven't read it yet, we highly recommend starting there.
Currently, Monty relies on a couple of dependencies that can NOT be installed on standard robotics hardware, such as the Raspberry Pi. We are working on removing those from the default dependencies, but for now, we recommend not running Monty on the robot's hardware directly. Instead, one can stream the sensor outputs and action commands back and forth, between lightweight code running on the physical system, and Monty running on a laptop (or other cloud computing infrastructure). This has the advantage of simplifying debugging and visualizing what happens in Monty. It also makes it easier to run more complex instances of Monty (many learning modules) without running into the limitations of on-device computational power. For some applications, the additional delay of streaming the data may cause issues, and future work will investigate how big of a problem this is and how we can allow Monty to run on the device in those cases.
When choosing hardware to use with Monty, we recommend thinking through certain requirements in advance to avoid pain points later on. In particular, Monty's performance depends on its ability to accurately map its observations into a fixed reference frame. For this to be successful, we recommend designing your system so that it can accurately measure the following:
- An object's position relative to the sensor. Example: A well-calibrated RGBD camera with precise, consistent depth measurements.
- A sensor's position and orientation relative to the robot's body. Example: A swiveling camera fixed to a platform by a vertical rod, where the rod's dimensions and mounting points are known, and the swiveling mechanism provides up-to-date measurements of its current angle.
- The robot's position and orientation.. Example: A handheld device whose position and orientation are measured by external (6DoF) trackers. In principal, one can use relative changes to update a pose estimate in place of external tracking. In practice, the updating method introduces drift which can quickly become a problem. It may be easier to use external trackers, depending on the setting. Otherwise, plan on having a way to minimize or compensate for drift.
Monty Meets World is the code name for our first demo of Monty on real-world data. For a video of this demo, see our project showcase page. In a previous tutorial we showed how we can recognize objects and their pose from a dataset collected with the iPad camera. Now we will turn this into a live demo where the iPad directly streams its camera image to Monty.
In this application, we wrote a MontyMeetsWorld iOS app that runs locally on the iPad (or iPhone). The app has a button that the user can press to take an image with the user-facing TrueDepth camera. When a picture is taken, it is streamed to a server running locally on a laptop, where it gets saved in a dedicated folder. At the same time, Monty is running on the laptop.

Monty is configured to use the SaccadeOnImageFromStreamEnvironmentInterface. The environment interface's pre_epoch function calls switch_to_scene on the SaccadeOnImageFromSteamEnvironment, which does nothing until a new image is found in the dedicated folder. Once it detects that a new image was saved there, it loads this image, and the episode starts. The environment interface then moves a small patch over the image (the same way as in the non-streamed version explained in the previous tutorial) and sends the observations from the moving patch to Monty until Monty recognizes the object. After that, it ends the current episode and returns to waiting for the next image, which will start the next episode.
Note that in this example, we are not controlling any external actuators. All of Monty's movements happen virtually by moving a small patch over the larger image. In theory, there is nothing preventing Monty from streaming an action command back to the robot. However, in this case, there isn't an automated way to move an iPad in space. There could be an option to move the iPad manually and send this movement information to Monty along with the sensed image. However, this would require movement tracking of the iPad which was out of the scope of the five-day hackathon when we implemented this.

📘 Follow Along If you would like to test the MontyMeetsWorld app, you can find code and run instructions here. To run the demo there are three main steps involved:
- Open the MontyMeetsWorld project in XCode and run the iOS app on your iPad or iPhone (instructions in this README)
- Start a server on your laptop to listen for images streamed from the app by running
python src/tbp/monty/frameworks/environment_utils/server.py- Start a Monty experiment that will wait for an image to be received and then run an episode to recognize the object. Use this command in a separate terminal window (while the server script and app are running):
python benchmarks/run.py -e world_image_from_stream_on_scanned_modelMake sure to set your WiFi's IP address in the server.py script and the app settings on your device. Then, once the app, the server, and the Monty Experiment are running, you can show an object to the camera and press the
Save Imagebutton in the app.
During the 2025 Robot Hackathon a working version of the Ultrasound application was prototyped. You can see the code here: https://github.com/thousandbrainsproject/ultrasound_perception
For the ultrasound demo project, we went through the same thought process as outlined for any Monty application in the previous tutorial. We needed to define observations, movement, and how movement affects the state of the sensor and its observations. The sensor is a handheld ultrasound device. In this case, Monty is not actively moving the sensor. Instead, a human operator moves the ultrasound device while Monty can (optionally) suggest positions that the operator should move the sensor to. Although Monty is not actively moving the sensor, it still needs to know how the sensor is moving. For this, we decided to attach a Vive Tracker to the ultrasound device, which uses two wall-mounted base stations to track the pose (6DOF location and orientation) of the sensor in the room.
Both the location information and the ultrasound recordings are streamed to a laptop that runs Monty. To stream the ultrasound data, we can write a small iOS app, similar to the MontyMeetsWorld app, using the ultrasound device SDK. Once the ultrasound image arrives on the laptop, Monty can move a small patch over it. It can also use the tracked sensor location in the room to integrate the physical movement of the probe.
A custom sensor module can then extract features and poses from the ultrasound image. The pose could be extracted from the surface normal detected at borders in the patch. Features could summarize information about measured density and texture.
For the first test, we recognize 3D objects inside a phantom (a clear bag filled with fluid). For simplicity, the objects can be learned beforehand in simulation, similar to the Monty Meets World application. However, since we have 6DOF pose tracking of the sensor, we can also learn the objects directly from the ultrasound data in the real world.
Monty can use three types of action output to recognize the sensed objects efficiently.
- It can move the patch over the full ultrasound image, akin to moving it over the full camera image from the iPad.
- It can suggest a location in the room for the human operator to move the ultrasound probe to in order to get a different view.
- It can adjust the settings of the probe, such as depth of field and gain. These are not required for object recognition to work, but they can help make recognition more efficient and robust.
📘 Follow Along
If you’re curious to see how this was set up, you can check out the Ultrasound Perception repository. See the videos and more pictures on the showcase page
During the 2025 Robot Hackathon a working version of the LEGO robot application was prototyped. You can see the code here: https://github.com/thousandbrainsproject/everything_is_awesome
This robot used Monty to explore and learn about real-world objects. This project was our first full integration of Monty with a physical robot that could sense the environment on it's own and move in 3D space.
The robot was built from LEGO Technic parts, Raspberry Pi boards, and off-the-shelf sensors. We used two Pis, one to control the motors and read from a depth sensor, and another to read the RGB image sensor. Monty itself ran on a nearby laptop, which communicated with the robot over a local Wi-Fi network. This setup allowed us to keep the computation off the robot while still enabling real-time streaming of sensor data and motor commands.
Rather than move the entire robot around an object, we placed the object on a rotating platform. As the object turned, the robot’s camera experienced the same kind of movement as if the robot were orbiting the object. This trick made things simpler mechanically while still allowing Monty to build a 3D model of the object using its depth and RGB observations.
The core idea behind the project was to create a real robot that could explore the target object, learn what it looks like, and later recognize it, even if it was moved or rotated. It was exciting to see Monty, originally tested in simulated environments, start to perceive and interact with the physical world in real time.
📘 Follow Along
If you’re curious to see how this was set up, you can check out the Everything Is Awesome repository. We include the parts list, Raspberry Pi setup guides, custom
everything_is_awesomeclasses and some project visualizations. See the videos and more pictures on the showcase page
The current solution for running Monty on robots is to stream the sensor data and action commands back and forth between the robot and a Monty instance running on a laptop. Outside of that, defining a custom environment interface for Monty is analogous to how it was outlined in the previous tutorial.
⚠️ This is an Advanced Tutorial If you've arrived at this page and you're relatively new to Monty, then we would recommend you start by reading some of our other documentation first. Once you're comfortable with the core concepts of Monty, then we think you'll enjoy learning about how to apply it to custom applications in the following tutorial!
Monty aims to implement a general-purpose algorithm for understanding and interacting with the world. It was designed to be very modular so that the same Monty configuration can be tested in many different environments and various Monty configurations can be compared in the same environment. Up to now, the tutorials have demonstrated Monty in a simulated environment (HabitatSim) where a sensor explores 3D objects and recognizes their ID and pose. Here, we will show you how to use Monty in other environments.
Monty is a sensorimotor modeling system. It is NOT made for learning from static datasets (although some can be framed to introduce movement, such as the Omniglot example below). Any application where you want to use Monty should have some concept of movement and how movement will change the state of the agent and what is being observed.
⚠️ Monty Currently Expects Movement to be in 3D Euclidean Space In the current implementation, movement should happen in 3D (or less) space and be tracked using Euclidean location coordinates. Although we are convinced that the basic principles of Monty will also apply to abstract spaces (potentially embedded in 3D space) and we know that the brain uses different mechanisms to encode space, the current implementation relies on 3D Euclidean space.
The diagram below shows the base abstract classes in Monty. For general information on how to customize those classes, see our guide on Customizing Monty.
The Experiment class coordinates the experiment (learning and evaluation). It initializes and controls Monty and the environment and coordinates the interaction between them.
The EmbodiedEnvironment class is wrapped in an EnvironmentInterface subclass, which exposes methods to interact with the environment. An experiment can have two environment interfaces associated with it: one for training and one for evaluation.
Information flow in Monty implements a sensorimotor loop. Observations from the environment are first processed by the sensor module. The resulting CMP-compliant output is then used by the learning modules to model and recognize what it is sensing. The learning modules can suggest an action (GoalState) to the motor system at each step. The motor system decides which action to execute and translates it into motor commands. The environment interface then uses this action to extract the next observation from the environment. The next observation is sent to the sensor module(s) and the loop repeats.
Additionally, the EnvironmentInterface and Environment can implement specific functions to be executed at different points in the experiment, such as resetting the agent position and showing a new object or scene at the beginning of a new episode.
To use Monty in a custom environment, you usually need to customize the EnvironmentInterface and EmbodiedEnvironment classes. For example, if you look back at the previous tutorials, you will see that for those Habitat experiments, we've been using the EnvironmentInterfacePerObject and the HabitatEnvironment. The diagram below shows some key elements that need to be defined for these two classes. It's best to start thinking about the environment setup first, as this will force you to think through how to structure your application correctly for Monty to tackle.

The first thing to figure out is how movement should be defined in your environment. What actions are possible, and how do these actions change the agent's state and observations?
If you are working with an existing environment, such as one used for reinforcement learning (for example, gym environments or the Habitat environment we are using), you might just need to wrap this into the .step() function of your custom EmbodiedEnvironment class such that when env.step(actions) is called, an observation is returned. If you work with an application that isn't already set up like that, defining how actions lead to the next observation may be more involved. You can look at the OmniglotEnvironmentInterface or SaccadeOnImageEnvironmentInterface as examples (more details below).
The observations should be returned as a nested dictionary with one entry per agent in the environment. Each agent should have sub-dictionaries with observations for each of its sensors. For example, if there is one agent with two sensors that each sense two types of features, it would look like this:
obs = {
AgentID("agent_id_0"): {
"patch_0": {
"depth": depth_sensed_by_patch_0,
"rgba": rgba_sensed_by_patch_0
},
"patch_1": {
"depth": depth_sensed_by_patch_1,
"semantic": semantic_sensed_by_patch_1,
},
}
}Related to defining how actions change observations, you will also need to define how actions change the state of the agent. This is what the get_state() function returns. The returned state needs to be a dictionary with an entry per agent in the environment. The entry should contain the agent's position and orientation relative to some global reference point. For each sensor associated with that agent, a sub-dictionary should return the sensor's position and orientation relative to the agent.
For example, if you have one agent with two sensors, the state dictionary could look like this:
state = {
AgentID("agent_id_0"): {
"position": current_agent_location,
"rotation": current_agent_orientation,
"sensors": {
"patch_0.depth": {
"rotation": current_depth_sensor_orientation,
"position": current_depth_sensor_location,
},
"patch_0.rgba": {
"rotation": current_rgba_sensor_orientation,
"position": current_rgba_sensor_orientation,
},
},
}
}Lastly, you need to define what happens when the environment is initialized (__init__()), when it is reset (reset(), usually at the end of an episode), and when it is closed (close(), at the end of an experiment). Resetting could include loading a new scene, resetting the agent position, or changing the arrangement of objects in the environment. It might also reset some of the environment's internal variables, such as step counters. Note that, as customary for RL environments, the reset() function is also expected to return an observation.
The EnvironmentInterface manages retrieving observations from the EmbodiedEnvironment given actions. The EmbodiedEnvironment, in turn, applies basic transforms to the raw observations from the environment.
The EnvironmentInterface should define all the key events at which the environment needs to be accessed or modified. This includes initializing the environment (__init__()), retrieving the next observation (__next__()), and things that happen at the beginning or end of episodes and epochs (pre_episode(), post_episode(), pre_epoch(), post_epoch()). Note that not all of those are relevant to every application.
Think about how your experiment should be structured. What defines an episode? What happens with the environment at the beginning or end of each episode? What happens at the beginning or end of epochs? Does anything need to happen at every step besides retrieving the observation and environment state?
As one of Monty's strengths is the ability to learn from small amounts of data, one interesting application is the Omniglot dataset. It contains drawings of 1623 characters from 50 alphabets. Each of the characters is drawn 20 times by different people, as shown below.

Since this is a static dataset, and Monty is a sensorimotor learning system, we first have to define how movement looks on this dataset. A sensor module in Monty always receives a small patch as input and the learning module then integrates the extracted features and locations over time to learn and infer complete objects. So, in this case, we can take a small patch on the character (as shown on the right in the figure below) and move this patch further along the strokes at each step. Following the strokes is easy in this case as the Omniglot dataset also contains the temporal sequence of x,y, and z coordinates in which the characters were drawn (the example image above is colored by the order in which the strokes were drawn but also within each stroke we have access to the temporal sequence in which it was drawn). If this information were unavailable, the patch could be moved arbitrarily or use heuristics such as following the sensed principal curvature directions.

At each step, the sensor module will extract a location and pose in a common reference frame and send it to the learning module. To define the pose at each location, we extract a surface normal and two principal curvature directions from a gaussian smoothed image of the patch. As you can see in the images below, the surface normal will always point straight out of the image (as this is a 2D image, not a 3D object surface) and the first principal curvature direction aligns with the stroke direction while the second one is orthogonal to it. The learning module then stores those relative locations and orientations in the model of the respective character and can use them to recognize a character during inference.

Learning and inference on Omniglot characters can be implemented by writing two custom classes, the OmniglotEnvironment and the OmniglotEnvironmentInterface:
OmniglotEnvironment:- Defines initialization of all basic variables in the
__init__(patch_size, data_path)function. - In this example, we define the action space as
Nonebecause we give Monty no choice in how to move. The step function just returns the next observation by following the predefined stroke order in the dataset. Note this will still be formulated as a sensorimotor task, as the retrieval of the next observation corresponds to a (pre-defined) movement and we get a relative displacement of the sensor. - Defines the
step(actions)function, which uses the currentstep_numin the episode to determine where we are in the stroke sequence and extracts a patch around that location. It then returns a Gaussian smoothed version of this patch as the observation. - Defines
get_state(), which returns the current x, y, z location on the character as a state dict (z is always zero since we are in 2D space here). - Defines
reset()to reset thestep_numcounter and return the first observation on a new character. - Helper functions such as
switch_to_objectandload_new_character_datato load a new characterget_image_patch(img, loc, patch_size)to extract the patch around a given pixel locationmotor_to_locationsto convert the movement information from the Omniglot dataset into locations (pixel indices) on the character image
- Defines initialization of all basic variables in the
OmniglotEnvironmentInterface:- Defines initialization of basic variables such as episode and epoch counters in the
__init__function - Defines the
post_episodefunction, which callscycle_objectto call the environment'sswitch_to_objectfunction. Using the episode and epoch counters, it keeps track of which character needs to be shown next.
- Defines initialization of basic variables such as episode and epoch counters in the
An experiment config for training on the Omniglot dataset can then look like this:
omniglot_training = dict(
experiment_class=MontySupervisedObjectPretrainingExperiment,
experiment_args=SupervisedPretrainingExperimentArgs(
n_train_epochs=1,
),
logging_config=PretrainLoggingConfig(
output_dir=pretrain_dir,
),
monty_config=PatchAndViewMontyConfig(
# Take 1 step at a time, following the drawing path of the letter
motor_system_config=MotorSystemConfigInformedNoTransStepS1(),
sensor_module_configs=omniglot_sensor_module_config,
),
env_interface_config=OmniglotEnvironmentInterfaceConfig(),
train_env_interface_class=ED.OmniglotEnvironmentInterface,
# Train on the first version of each character (there are 20 drawings for each
# character in each alphabet, here we see one of them). The default
# OmniglotEnvironmentInterfaceArgs specify alphabets = [0, 0, 0, 1, 1, 1] and
# characters = [1, 2, 3, 1, 2, 3]) so in the first episode we will see version 1
# of character 1 in alphabet 0, in the next episode version 1 of character 2 in
# alphabet 0, and so on.
train_env_interface_args=OmniglotEnvironmentInterfaceArgs(versions=[1, 1, 1, 1, 1, 1]),
)And a config for inference on those trained models could look like this:
omniglot_inference = dict(
experiment_class=MontyObjectRecognitionExperiment,
experiment_args=ExperimentArgs(
model_name_or_path=pretrain_dir + "/omniglot_training/pretrained/",
do_train=False,
n_eval_epochs=1,
),
logging_config=LoggingConfig(),
monty_config=PatchAndViewMontyConfig(
monty_class=MontyForEvidenceGraphMatching,
learning_module_configs=dict(
learning_module_0=dict(
learning_module_class=EvidenceGraphLM,
learning_module_args=dict(
# xyz values are in larger range so need to increase mmd
max_match_distance=5,
tolerances={
"patch": {
"principal_curvatures_log": np.ones(2),
"pose_vectors": np.ones(3) * 45,
}
},
# Surface normal always points up, so they are not useful
feature_weights={
"patch": {
"pose_vectors": [0, 1, 0],
}
},
hypotheses_updater_args=dict(
# We assume the letter is presented upright
initial_possible_poses=[[0, 0, 0]],
)
),
)
),
sensor_module_configs=omniglot_sensor_module_config,
),
env_interface_config=OmniglotEnvironmentInterfaceConfig(),
eval_env_interface_class=ED.OmniglotEnvironmentInterface,
# Using version 1 means testing on the same version of the character as trained.
# Version 2 is a new drawing of the previously seen characters. In this small test
# setting these are 3 characters from 2 alphabets.
eval_env_interface_args=OmniglotEnvironmentInterfaceArgs(versions=[1, 1, 1, 1, 1, 1]),
# eval_env_interface_args=OmniglotEnvironmentInterfaceArgs(versions=[2, 2, 2, 2, 2, 2]),
)📘 Follow Along To run the above experiment, you first need to download the Omniglot dataset. You can do this by running
cd ~/tbp/dataandgit clone https://github.com/brendenlake/omniglot.git. You will need to unzip theomniglot/python/images_background.zipandomniglot/python/strokes_background.zipfiles.
To test this, go ahead and copy the configs above into the benchmarks/configs/my_experiments.py file. To complete the configs, you will need to add the following imports, sensor module config and model_path at the top of the file.
import os
from dataclasses import asdict
import numpy as np
from benchmarks.configs.names import MyExperiments
from tbp.monty.frameworks.config_utils.config_args import (
LoggingConfig,
MotorSystemConfigInformedNoTransStepS1,
PatchAndViewMontyConfig,
PretrainLoggingConfig,
)
from tbp.monty.frameworks.config_utils.make_env_interface_configs import (
ExperimentArgs,
OmniglotEnvironmentInterfaceArgs,
OmniglotEnvironmentInterfaceConfig,
SupervisedPretrainingExperimentArgs,
)
from tbp.monty.frameworks.environments import embodied_data as ED
from tbp.monty.frameworks.experiments import (
MontyObjectRecognitionExperiment,
MontySupervisedObjectPretrainingExperiment,
)
from tbp.monty.frameworks.models.evidence_matching.learning_module import (
EvidenceGraphLM
)
from tbp.monty.frameworks.models.evidence_matching.model import (
MontyForEvidenceGraphMatching
)
from tbp.monty.frameworks.models.sensor_modules import (
HabitatSM,
Probe,
)
monty_models_dir = os.getenv("MONTY_MODELS")
pretrain_dir = os.path.expanduser(os.path.join(monty_models_dir, "omniglot"))
omniglot_sensor_module_config = dict(
sensor_module_0=dict(
sensor_module_class=HabitatSM,
sensor_module_args=dict(
sensor_module_id="patch",
features=[
"pose_vectors",
"pose_fully_defined",
"on_object",
"principal_curvatures_log",
],
save_raw_obs=False,
# Need to set this lower since curvature is generally lower
pc1_is_pc2_threshold=1,
),
),
sensor_module_1=dict(
sensor_module_class=Probe,
sensor_module_args=dict(
sensor_module_id="view_finder",
save_raw_obs=False,
),
),
)Finally, you will need to set the experiments variable at the bottom of the file to this:
experiments = MyExperiments(
omniglot_training=omniglot_training,
omniglot_inference=omniglot_inference,
)
CONFIGS = asdict(experiments)And add the two experiments into the MyExperiments class in benchmarks/configs/names.py:
@dataclass
class MyExperiments:
omniglot_training: dict
omniglot_inference: dictNow you can run training by calling python benchmarks/run.py -e omniglot_training and then inference on these models by calling python benchmarks/run.py -e omniglot_inference. You can check the eval_stats.csv file in ~/tbp/results/monty/projects/monty_runs/omniglot_inference/ to see how Monty did. If you copied the code above, it should have recognized all six characters correctly.
❗️ Generalization Performance on Omniglot is Bad Without Hierarchy Note that we currently don't get good generalization performance on the Omniglot dataset. If you use the commented-out env_interface_config (
eval_env_interface_args=OmniglotEnvironmentInterfaceArgs(versions=[2, 2, 2, 2, 2, 2])) in the inference config, which shows previously unseen versions of the characters, you will see that performance degrades a lot. This is because the Omniglot characters are fundamentally compositional objects (strokes relative to each other), and compositional objects can only be modeled by stacking two learning modules hierarchically. The above configs do not do this. Our research team is hard at work getting Monty to model compositional objects.
Monty Meets World is the code name for our first demo of Monty on real-world data. For a video of this momentous moment (or is that Montymentous?), see our project showcase page.
In this application we test Monty's object recognition skills on 2.5D images, which means a photograph that includes depth information (RGBD). In this case, the pictures are taken with the iPad's TrueDepth camera (the user-facing camera used for face recognition).
In this use case, we assume that Monty has already learned 3D models of the objects, and we just test its inference capabilities. For training, we scanned a set of real-world objects using photogrammetry, providing us with 3D models of the objects. You can find instructions to download this numenta_lab dataset here. We then render those 3D models in Habitat and learn them by moving a sensor patch over them, just as we do with the YCB dataset. We train Monty in the 3D simulator because in the 2D image setup, Monty has no way of moving around the object and, therefore, would have a hard time learning complete 3D models.
To run this pre-training yourself, you can use the only_surf_agent_training_numenta_lab_obj config. Alternatively, you can download the pre-trained models using the benchmark experiment instructions.
For inference, we use the RGBD images taken with the iPad camera. Movement is defined as a small patch on the image moving up, down, left, and right. At the beginning of an episode, the depth image is converted into a 3D point cloud with one point per pixel. The sensor's location at every step is then determined by looking up the current center pixel location in that 3D point cloud. Each episode presents Monty with one image, and Monty takes as many steps as needed to make a confident classification of the object and its pose.
This can be implemented using two custom classes the SaccadeOnImageEnvironment and SaccadeOnImageEnvironmentInterface:
SaccadeOnImageEnvironment:- Defines initialization of all basic variables in the
__init__(patch_size, data_path)function. - Defines the
TwoDDataActionSpaceto move up, down, left, and right on the image by a given amount of pixels. - Defines the
step(actions)function, which uses the sensor's current location, the given actions, and their amounts to determine the new location on the image and extract a patch. It updatesself.current_locand returns the sensor patch observations as a dictionary. - Defines
get_state(), which returns the current state as a dictionary. The dictionary mostly containsself.current_locand placeholders for the orientation, as the sensor and agent orientation never change. - Helper functions such as
switch_to_object(scene_id, scene_version_id)to load a new imageget_3d_scene_point_cloudto extract a 3D point cloud from the depth imageget_next_loc(action_name, amount)to determine valid next locations in pixel spaceget_3d_coordinates_from_pixel_indices(pixel_ids)to get the 3D location from a pixel indexget_image_patch(loc)to extract a patch at a location in the image. These functions are all used internally within the__init__,step, andget_statefunctions (except for theswitch_to_objectfunction, which is called by theSaccadeOnImageEnvironmentInterface).
- Defines initialization of all basic variables in the
SaccadeOnImageEnvironmentInterface:- Defines initialization of basic variables such as episode and epoch counters in the
__init__function. - Defines the
post_episodefunction, which callscycle_objectto call the environment'sswitch_to_objectfunction. Using the episode and epoch counters, it keeps track of which image needs to be shown next.
- Defines initialization of basic variables such as episode and epoch counters in the
An experiment config can then look like this:
monty_meets_world_2dimage_inference = dict(
experiment_class=MontyObjectRecognitionExperiment,
experiment_args=EvalExperimentArgs(
model_name_or_path=model_path_numenta_lab_obj,
n_eval_epochs=1,
),
logging_config=ParallelEvidenceLMLoggingConfig(wandb_group="benchmark_experiments"),
monty_config=PatchAndViewMontyConfig(
learning_module_configs=default_evidence_1lm_config,
monty_args=MontyArgs(min_eval_steps=min_eval_steps),
# move 20 pixels at a time
motor_system_config=MotorSystemConfigInformedNoTransStepS20(),
),
env_interface_config=WorldImageEnvironmentInterfaceConfig(
env_init_args=EnvInitArgsMontyWorldStandardScenes()
),
eval_env_interface_class=ED.SaccadeOnImageEnvironmentInterface,
eval_env_interface_args=WorldImageEnvironmentInterfaceArgs(
scenes=list(np.repeat(range(12), 4)),
versions=list(np.tile(range(4), 12)),
),
)
For more configs to test on different subsets of the Monty Meets World dataset (such as bright or dark images, hand intrusion, and multiple objects), you can find all RGBD image benchmark configs here.
📘 Follow Along To run this experiment, you first need to download our 2D image dataset called
worldimages. You can find instructions for this here.You will also need to download the pre-trained models. Alternatively, you can run pre-training yourself by running
python benchmarks/run.py -e only_surf_agent_training_numenta_lab_obj. Running pre-training requires the Habitat simulator and downloading thenumenta_lab3D mesh dataset.
Analogous to the previous tutorials, you can copy the config above into the benchmarks/configs/my_experiments.py file. You must also add monty_meets_world_2dimage_inference: dict to the MyExperiments class in benchmarks/configs/names.py. Finally, you will need to add the following imports at the top of the my_experiments.py file:
import os
from dataclasses import asdict
import numpy as np
from benchmarks.configs.defaults import (
default_evidence_1lm_config,
min_eval_steps,
pretrained_dir,
)
from benchmarks.configs.names import MyExperiments
from tbp.monty.frameworks.config_utils.config_args import (
MontyArgs,
MotorSystemConfigInformedNoTransStepS20,
ParallelEvidenceLMLoggingConfig,
PatchAndViewMontyConfig,
)
from tbp.monty.frameworks.config_utils.make_env_interface_configs import (
EnvInitArgsMontyWorldStandardScenes,
EvalExperimentArgs,
WorldImageEnvironmentInterfaceArgs,
WorldImageEnvironmentInterfaceConfig,
)
from tbp.monty.frameworks.environments import embodied_data as ED
from tbp.monty.frameworks.experiments import MontyObjectRecognitionExperiment
model_path_numenta_lab_obj = os.path.join(
pretrained_dir,
"surf_agent_1lm_numenta_lab_obj/pretrained/",
)
To run the experiment, call python benchmarks/run.py -e monty_meets_world_2dimage_inference. If you don't want to log to wandb, add wandb_handlers=[] to the logging_config. If you just want to run a quick test on a few of the images, simply adjust the scenes and versions parameters in the eval_env_interface_args.
If your application uses sensors different from our commonly used cameras and depth sensors, or you want to extract specific features from your sensory input, you will need to define a custom sensor module. The sensor module receives the raw observations from the environment interface and converts them into the CMP, which contains features at poses. For more details on converting raw observations into the CMP, see our documentation on sensor modules.
If your application requires a specific policy to move through the environment or you have a complex actuator to control, you might want to implement a custom MotorSystem or MotorPolicy class. For more details on our existing motor system and policies, see our documentation on Monty's policies.
Writing those custom classes works the same way as it does for the EnvironmentInterface and EmbodiedEnvironment classes. For general information, see our documentation on customizing Monty.
This tutorial was a bit more text than practical code. This is because every application is different, and we try to convey the general principles here. The first step for any application is to think about if and how the task can be phrased as a sensorimotor environment. What is Monty's action space? How is movement defined? How does it change observations? How do movement and sensation determine the sensor's location and orientation in space? This will then help you figure out how to define a custom EmbodiedEnvironment and EnvironmentInterface and their associated __init__, step, get_state,reset, pre_episode, and post_episode functions. If you run into issues customizing Monty to your application, please come over to our Discourse Forum and ask for help!
flowchart LR
CPR(Create Pull Request):::contributor --> T(Triage):::maintainer
T --> V{Is valid?}
V -- Yes --> RCP{Checks pass?}
RCP -- No --> UPR(Update Pull Request):::contributor
UPR --> RCP
V -- No --> X(((Reject))):::endFail
RCP -- Yes --> R(Review):::maintainer
R --> C{Needs changes?}
C -- Yes --> RC(Request Changes):::maintainer
RC --> UPR
C -- No --> A(Approve):::maintainer
A --> NBCP
A --> SC(Suggested Commits):::contributor
SC --> NBCP{Checks pass?}
A --> UC(Unexpected Commits):::contributor
UC --> WA(Withdraw Approval):::maintainer
WA --> RCP
NBCP -- No --> UPR2(Update Pull Request):::contributor
UPR2 --> NBCP
NBCP -- Yes --> M(Merge):::maintainer
M --> AMCP{Post-merge<br>checks and tasks<br>pass?}
AMCP -- No --> RV(((Revert))):::endFail
AMCP -- Yes --> D(((Done))):::endSuccess
classDef contributor fill:blue,stroke:white,color:white
classDef maintainer fill:gray,stroke:white,color:white
classDef endFail fill:red,stroke:white,color:white
classDef endSuccess fill:lightgreen,stroke:black,color:black
We use the term heterarchy instead of hierarchy to express the notion that information flow in Monty is not purely feed-forward (+ feedback in reverse order) as in many classical views of hierarchy. Even though we do speak of lower-level LMs and higher-level LMs at times, this does not mean that information strictly flows from layer 0 to layer N in a linear fashion.
First of all, there can be skip connections. A low-level LM or even an SM can directly connect to another LM which represents far more complex, high-level models and which might additionally get input from other LMs that has been processed by several other LMs before. Therefore, it is difficult to clearly identify what layer an LM is in based on the number of previous processing steps performed on its input. Instead LMs could be grouped into collections based on who votes with each other, which is defined by whether there is overlap in the objects they model.
Second, we have several other channels of communication in Monty that do not implement a hierarchical forward pass of information (see figure below). An LM can receive top-down input from higher-level LMs (LMs that model objects which are composed of object parts modeled in the receiving LM) as biasing context. Another top-down input to the LM is the goal state, used for modeling hierarchical action policies. Finally, the LM receives lateral inputs from votes and recurrently updates its internal representations.
Lastly, each LM can send motor outputs directly to the motor system. Contrary to the idea that sensory input is processed through a series of hierarchical steps until it reaches the motor area which then produces action plans, we look at each LM as a complete sensorimotor system. Motor output is not exclusive to the end of the hierarchy, but rather occurs at every level of sensory processing.
Due to those reasons we call Monty a heterarchical system instead of a hierarchical system. Despite that, we often use terminology as if we did have a conventional hierarchical organization, such as top-down and bottom-up input and lower-level and higher-level LMs.
Note
For the neuroscience theory and evidence behind this, see our recent pre-print "Hierarchy or Heterarchy? A Theory of Long-Range Connections for the Sensorimotor Brain"
Connection we refer to as bottom-up connections are connections from SMs to LMs and connections between LMs that communicate an LMs output (the current most likely object ID and pose) to the main input channel of another LM (the current sensed feature and pose). The output object ID of the sending LM then becomes a feature in the models learned in the receiving LM. For example, the sending LM might be modeling a tire. When the tire model is recognized, it outputs this and the recognized location and orientation of the tire relative to the body. The receiving LM would not get any information about the 3D structure of the tire from the sending LM. It would only receive the object ID (as a feature) and its pose. This LM could then model a car, composed of different parts. Each part, like the tire, is modeled in detail in a lower-level LM and then becomes a feature in the higher-level LMs' model of the car.
The receiving LM might additionally get input from other LMs and SMs. For example, the LM modeling the car could also receive low-frequence input from a sensor module and incorporate this into its model. This input however is usually not as detailed as the input to the LM that models the tire. We do not want to relearn a detailed model of the entire car. Instead we want to learn detailed models of its components and then compose the components into a larger model. This way we can also reuse the model of the tire in other higher-level models such as for trucks, busses, and wheel barrels.
Top-down connections can bias the hypothesis space of the receiving LM, similar to how votes can do this. They contain a copy of the output from a higher-level LM. For example, if a higher-level LM recognizes a car, this can bias the lower level LMs to recognize the components of a car. Compared to votes, in this case the sending and receiving LMs do not have models of the same object. Instead, you could compare this to associative connections that are learning through past observed co-occurrence. Essentially, the lower-level LM would learn "Previously when I received car at this pose as top-down input, I was sensing a tire, so I am also more likely to be observing a tire now". The car might have been recognized before the tire, based on other parts of the car or its rough outline. Importantly, the top down connection does not only include object but also pose information. Overall, top-down input allows for recognizing objects faster given context from a larger scene or object.
The 3D environment used for most experiments is Habitat, wrapped into a class called EnvironmentInterface. This class returns observations for given actions.
The environment is currently initialized with one agent that has N sensors attached to it using the PatchAndViewFinderMountConfig. This config by default has two sensors. The first sensor is the sensor patch which will be used for learning. It is a camera, zoomed in 10x such that it only perceives a small patch of the environment. The second sensor is the view-finder which is at the same location as the patch and moves together with it but its camera is not zoomed in. This one is only used at the beginning of an episode to get a good view of the object (more details in the policy section) and for visualization, but not for learning or inference. The agent setup can also be customized to use more than one sensor patch (such as in TwoLMMontyConfig or FiveLMMontyConfig, see figure below). The configs also specify the type of sensor used, the features that are being extracted, and the motor policy used by the agent.
Generally, one can also initialize multiple agents and connect them to the same Monty model but there is currently no policy to coordinate them. The difference between adding more agents vs. adding more sensors to the same agent is that all sensors connected to one agent have to move together (like neighboring patches on the retina) while separate agents can move independently like fingers on a hand (see figure below).
The currently most used environment is an empty space with a floor and one object (although we have started experimenting with multiple objects). The object can be initialized in different rotations, positions and scales but currently does not move during an episode. For objects, one can either use the default habitat objects (cube, sphere, capsule, ...) or the YCB object dataset (Calli et al., 2015), containing 77 more complex objects such as a cup, bowl, chain, or hammer, as shown in figure below. Currently there is no physics simulation so objects are not affected by gravity or touch and therefore do not move.
Of course, the EnvironmentInterface classes can also be customized to use different environments and setups as shown in the table below. We are not even limited to 3D environments and can for example let an agent move in 2D over an image such as when testing on the Omniglot dataset. The only crucial requirement is that we can use an action to retrieve a new, action-dependent, observation from which we can extract a pose.
Note
For some examples of how to use Monty with other environments than Habitat (or even in the real world), see this tutorial.
| List of all environment interface classes | Description |
|---|---|
| EnvironmentInterface | Base environment interface class that implements basic iter and next functionalities and episode and epoch checkpoint calls for Monty to interact with an environment. |
| EnvironmentInterfacePerObject | Environment interface for testing on the environment with one object. After each epoch, it removes the previous object from the environment and loads the next object in the pose provided in the parameters. |
| InformedEnvironmentInterface | Environment interface that allows for input-driven, model-free policies (using the previous observation to decide the next action). It implements the find_good_view function and other helpers for the agent to stay on the object surface. It also supports jumping to a target state when driven by model-based policies, although this is a TODO to refactor "jumps" into motor_policies.py. |
| OmniglotEnvironmentInterface | Environment interface that wraps around the Omniglot dataset and allows movement along the stroke of the different characters. Has similar cycle_object structure as EnvironmentInterfacePerObject. |
| List of all environments | Description |
|---|---|
| EmbodiedEnvironment | Abstract environment class. |
| HabitatEnvironment | Environment that initializes a 3D Habitat scene and allows interaction with it. |
| RealRobotsEnvironment | Environment that initializes a real robots simulation (gym) and allows interaction with it. |
| OmniglotEnvironment | Environment for the Omniglot dataset. This is originally a 2D image dataset but reformulated here as stroke movements in 2D space. |
We also have the ObjectBehaviorEnvironment class in the monty_lab repository (object_behaviors/environment.py) for testing moving objects. However, this class still needs to be integrated into the main framework.
During an experiment in the Monty framework an agent is collecting a sequence of observations by interacting with an environment. We distinguish between training (internal models are being updated using this sequence of observations) and evaluation (the agent only performs inference using already learned models but does not update them). The MontyExperiment class implements and coordinates this training and evaluation of Monty models.
In reality an agent interacts continuously with the world and time is not explicitly discretized. For easier implementation we use steps as the smallest increment of time. Additionally, we divide an experiment into multiple episodes and epochs for easier measurement of performance. Overall, we discretize time in the three ways listed below.
-
Step: taking zero, one, or more actions and receiving one observation. There are different types of steps that track more specifically whether learning module updates were performed either individually for each LM or more globally for the Monty class.
-
monty_step (model.episode_steps total_steps): number of observations sent to the Monty model. This includes observations that were not interesting enough to be sent to an LM such as off-object observations. It includes both matching and exploratory steps.
-
monty_matching_step (
model.matching_steps): At least one LM performed a matching step (updating its possible matches using an observation). There are also exploratory steps which do not update possible matches and only store an observation in the LMs buffer. These are not counted here. -
num_steps (
lm.buffer.get_num_matching_steps): Number of matching steps that a specific LM performed. -
lm_step (
max(num_steps)): Number of matching steps performed by the LM that took the most steps. -
lm_steps_indv_ts (
lm.buffer["individual_ts_reached_at_step"]): Number of matching steps a specific LM performed until reaching a local terminal state. A local terminal state means that a specific LM has settled on a result (match or no match). This does not mean that the entire Monty system has reached a terminal state since it usually requires multiple LMs to have reached a local terminal state. For more details see section Terminal Condition
-
-
Episode: putting a single object in the environment and taking steps until a terminal condition is reached, like recognizing the object or exceeding max steps.
-
Epoch: running one episode on each object in the training or eval set of objects.
In the long term, we might remove the episode and epoch chunking and simply have the agent continuously interact with a given environment without resetting it. Removing the step discretization of time will probably not be possible (maybe with neuromorphic hardware?) but we can simulate continuous time by making the step increments tiny and utilizing the different step types (like only sending an observation to the LM if a significant feature change was detected).
The learning module is designed to be able to learn objects unsupervised, from scratch. This means it is not assumed that we start with any previous knowledge or even complete objects stored in memory (even though there is an option to load pre-trained models for faster testing). This means that the models in graph memory are updated after every episode and learning and inference are tightly intertwined. If an object was recognized, the model of this object is updated with new points. If no object was recognized, a new model is generated and stored in memory. This also means that the whole learning procedure is unsupervised as there are no object labels provided [1].
To keep track of which objects were used for building a graph (since we do not provide object labels in this unsupervised learning setup) we store two lists in each learning module: target_to_graph_id and graph_id_to_target. target_to_graph_id maps each graph to the set of objects that were used to build this graph. graph_id_to_target maps each object to the set of graphs that contain observations from it. These lists can later be used for analysis and to determine the performance of the system but they are not used for learning. This means learning can happen completely unsupervised, without any labels being provided.
There are two modes the learning module could be in: training and evaluation. They are both very similar as both use the same procedure of moving and narrowing down the list of possible objects and poses. The only difference between the two is that in the training mode the models in graph memory are updated after every episode. In practice, we currently often load pre-trained models into the graph memory and then only evaluate these. This avoids computation to first learn all objects before every evaluation and makes it easier to test on complete, error-free object models. However, it is important to keep in mind that anything that happens during evaluation also happens during training and that these two modes are almost identical. Save for practical reasons (to save computational time) we would never have to run evaluation as we perform the same operations during training as well. Just as in real life, we want to think of systems as always learning and improving and never reaching a point where they only perform inference.
The training mode is split into two phases that alternate: The matching phase and the exploration phase. During the matching phase the module tries to determine the object ID and pose from a series of observations and actions. This is the same as in evaluation. After a terminal condition is met (object recognized or no possible match found) the module goes into the exploration phase. This phase continues to collect observations and adds them into the buffer the same way as during the previous phase, only the matching step is skipped. The exploration phase is used to add more information to the graph memory at the end of an episode. For example, the matching procedure could be done after three steps telling us that the past three observations are not consistent with any models in memory. Therefore we would want to store a new graph in memory but a graph made of only three observations is not very useful. Hence, we keep moving for num_exploratory_steps to collect more information about this object before adding it to memory. This is not necessary during evaluation since we do not update our models then.
| List of all experiment classes | Description |
|---|---|
| MontyExperiment | Abstract experiment class from which all other experiment classes inherit. The experiment class handles the initialization of all Monty components. It also takes care of logging and the highest level calls to train, evaluate, pre_epoch, run_epoch, post_epoch, pre_episode, run_episode, post_episode, pre_step, and post_step. |
| MontyObjectRecognitionExperiment | Experiment class to test object recognition. The current default class for Habitat experiments. Saves the target object and pose for logging. Also contains some custom terminal condition checking and online plotting. |
| MontyGeneralizationExperiment | Same as previous but removes the current target object from the memory of all LMs. Can be used to test generalization to new objects. |
| MontySupervisedObjectPretrainingExperiment | Here we provide the object and pose of the target to the model. This way we can make sure we learn exactly one graph per object in the dataset and use the correct pose when merging graphs. This class is used for model pre-training. |
| DataCollectionExperiment | Just runs exploration and saves results as .pt file. No object recognition is performed. |
| ProfileExperimentMixin | Can be added to any experiment class to add a profiler. |
title: FAQ - Monty description: Frequently asked questions about the Thousand Brains Theory, Monty, thousand-brains systems, and the underlying algorithms.
Below are responses to some of the frequently asked questions we have encountered. However, this is not an exhaustive list, so if you have a question, please reach out to us and the rest of the community at our Discourse page. We will also update this page with new questions as they come up.
We use these terms fairly interchangeably (particularly in our meetings), however the Thousand Brains Theory (TBT) is the underlying theory of how the neocortex works. Thousand-brains systems are artificial intelligence systems designed to operate according to the principles of the TBT. Finally, Monty is the name of the first system that implements the TBT (i.e. the first thousand-brains system), and is made available in our open-source code.
The Free-Energy Principle and Bayesian theories of the brain are interesting and broadly compatible with the principles of the Thousand Brains Theory (TBT). While they can be useful for neuroscience research, our view is that Bayesian approaches are often too broad for building practical, intelligent systems. Where attempts have been made to implement systems with these theories, they often require problematic assumptions, such as modeling uncertainty with Gaussian distributions. While the concept of the neocortex as a system that predicts the nature of the of world is common to the Free-Energy Principle and the Thousand Brains Theory (as well as several other ideas going back to Hermann von Helmholtz), we want to emphasize the key elements that set the TBT apart. This includes, for example, the use of a modular architecture with reference frames, where each module builds representations of entire objects, and a focus on building a learning system that can model physical objects, rather than on fitting neuroscience data.
One important prediction of the Thousand Brains Theory is that the intricate columnar structure found throughout brain regions, including primary sensory areas like V1 (early visual cortex), supports computations much more complex than extracting simple features for recognizing objects.
To recap a simple version of the model (i.e. simplified to exclude top-down feedback or motor outputs): We assume the simple features that are often detected experimentally in areas like V1 correspond to the feature input (layer 4) in a cortical column. Each column then integrates movement in L6, and uses features-at-locations to build a more stable representation of a larger object in L3 (i.e. larger than the receptive field of neurons in L4). L3’s lateral connections then support "voting", enabling columns to inform each-other’s predictions. Some arguments supporting this model are:
i) It is widely accepted that the brain is constantly trying to predict the future state of the world. A column can predict a feature (L4) at the next time point much better if it integrates movement, rather than just predicting the same exact feature, or predicting it based on a temporal sequence - bearing in mind that we can make movements in many directions when we do things like saccade our eyes. Reference frames enable predicting a particular feature, given a particular movement - if the column can do this, then it has built a 3D model of the object.
ii) Columns with different inputs need to be able to work together to form a consensus about what is in the world. This is much easier if they use stable representations, i.e. a larger object in L3, rather than representations that will change moment to moment, such as a low-level feature in L4. Fortunately this role for lateral connections fits well with anatomical studies.
We use a coffee mug as an illustrative example, because a single patch of skin on a single finger can support recognizing such an object by moving over it. With all this said however, we don’t know exactly what the nature of the “whole objects” in the L2/L3 layers of V1 would be (or other primary sensory areas for that matter). As the above model describes, we believe they would be significantly more complex than a simple edge or Gabor filter, corresponding to 3D, statistically repeating structures in the world that are cohesive in their representation over time.
It is also important to note that compositionality and hierarchy are still very important even if columns model whole objects. For example, a car can be made up of wheels, doors, seats, etc., which are distinct objects. The key argument is that a single column can do a surprising amount, more than what would be predicted by artificial neural network (ANN) style architectures, modeling much larger objects than their receptive fields would indicate.
We are focused on ensuring that the first generation of thousand-brain systems are interpretable and easy to iterate upon. Being able to conceptually understand what is happening in the Monty system, visualize it, debug it, and propose new algorithms in intuitive terms is something we believe to be extremely valuable for fast progress. As such, we have focused on the core principles of the TBT, but have not yet included lower-level neuroscience components such as HTM, sparse distributed representations (SDRs), grid-cells, and active dendrites. In the future, we will consider adding these elements where a clear case for a comparative advantage exists.
Part of the underlying Thousand Brains Theory is that the hippocampal formation evolved to enable rapid episodic memory, as well as spatial navigation in animals, with grid cells forming reference frames of environments. Over time, evolution repurposed this approach of modeling the world with reference frames in the form of cortical columns, which were then replicated throughout the neocortex to model objects at any level of abstraction.
As such, we believe that the core computations that a cortical column performs are similar to the hippocampal formation. Since learning modules (LMs) are designed to capture the former, they also implement the core capabilities of the latter. This includes elements such as how objects change as a function of time (analogous to episodic memory), and recognizing specific vs. general instances of objects (analogous to pattern separation vs. completion). The key difference we believe is the time-scale over which learning happens, with the hippocampal complex laying down new information much more rapidly. In Monty, this could be implemented as a high-level LM that adds new information to object models extremely quickly.
It’s worth noting that the speed of learning in LMs is an instance where Monty might be “super-human”, in that within computers, it is easy to rapidly build arbitrary, new associations, while this is challenging for biology. Evolution has required innovations such as silent synapses and an excess of synapses in the hippocampal formation to achieve this ability. This degree of neural hardware to support rapid learning cannot be easily replicated in the compact space of a cortical column, so in biology the latter will always learn much more slowly, and very rapid learning is a more specialized domain of the hippocampal formation.
In the case of Monty, we might choose to enable most LMs to have fairly rapid learning, although we believe there are other reasons that we might restrict this, such as ensuring that low-level representations do not drift (change) too rapidly with respect to high-level ones.
Finally, it's worth noting that there may be other unique features of the hippocampal complex that would require details not found in standard LMs, however our aim is to re-use the implementation of LMs where possible.
If "Cognitive Maps" are Found in the Hippocampal Formation, Why Do We Believe They Are Found in the Neocortex?
Neuroscientists have found evidence that grid cells in the hippocampal complex can encode abstract reference frames, such as the conceptual space of an object's properties (Constantinescu et al, 2016), what is often referred to as "cognitive maps" (O'Keefe and Nadel, 1978). The Thousand Brains Theory does not argue that abstract, structured representations cannot be found in the hippocampal formation. As noted above, this structure has been highly developed so as to enable very rapid learning, which would account for its important role in learning structured representations in mammals. Furthermore, the fact is that this neural hardware exists, so even if cortical columns learn structured representations, it is natural that these would also emerge in the hippocampal formation. This would result in a degree of redundancy between cortical columns and the hippocampal formation, at least early in learning. With this view, evidence for structured representations in the hippocampal formation does not imply that such representations cannot be found in cortical columns.
More generally, we believe the learning that takes place in the reference frames of cortical columns happens more slowly, which would make it more challenging to measure experimentally. If you are an experimental neuroscientist interested in structured representations of objects, we would love to discuss ways that it might be possible to measure the emergence of such representations in the cortical columns.
In short, not at the moment, and it's unclear how essential this would be.
It has been proposed that the ventral and dorsal streams of the visual cortex correspond to "what" vs "where" pathways respectively. In this context, the terms “what” vs. “where” can be misleading, as spatial computation is important throughout the brain, including within cortical columns in the what pathway. This is central to the TBT claim that every column leverages a reference frame, and so "what" should not be interpreted as there being a part of the brain that does not care about spatial relations. Even the alternative naming of the ventral and dorsal streams as a “perception” vs. “action” stream can be misleading, as all columns have motor projections. For example, eye control can be mediated by columns in the ventral stream projecting to the superior colliculus, as well as by other sensory regions.
However one distinction that might exist, at least in the brain, is the following: for columns to meaningfully communicate spatial information with one another, there needs to be some common reference frame. Within a column, the spatial representation is object-centric, but a common reference frame comes into play when different columns interact.
- One choice for this common reference frame is a body-centric coordinate system, which is likely at play in the dorsal ("where") stream. This would explain its importance for directing complex motor actions, as in the classic Milner and Goodale study that spawned the two-stream framing of function.
- An alternative choice is an “allocentric” reference frame, which could be some temporary landmark in the environment, such as the corners of a monitor, or a prominent object in a room. This may be utilized in the ventral ("what") pathway.
In Monty, the between-column computations, such as voting, have made use of an ego/body-centric shared coordinate system. However, this might change in the future, where motor coordination would benefit from egocentric coordinates, and reasoning about object interactions might benefit from allocentric coordinates. If ever implemented, this could be analogous to separate "what" and "where" pathways.
There are deep connections between the Thousand Brains Theory and Simultaneous Localization and Mapping (SLAM), or related methods like particle filters. This relationship was discussed, for example, in Numenta’s 2019 paper by Lewis et al, in a discussion of grid-cells:
“To combine information from multiple sensory observations, the rat could use each observation to recall the set of all locations associated with that feature. As it moves, it could then perform path integration to update each possible location. Subsequent sensory observations would be used to narrow down the set of locations and eventually disambiguate the location. At a high level, this general strategy underlies a set of localization algorithms from the field of robotics including Monte Carlo/particle filter localization, multi-hypothesis Kalman filters, and Markov localization (Thrun et al., 2005).”
This connection points to a deep relationship between the objectives that both engineers and evolution are trying to solve. Methods like SLAM emerged in robotics to enable navigation in environments, and the hippocampal complex evolved in organisms for a similar purpose. One of the arguments of the TBT is that the same spatial processing that supported representing environments was compressed into the 6-layer structure of cortical columns, and then replicated throughout the neocortex to support modeling all concepts with reference frames, not just environments.
Furthermore, Monty's evidence-based learning-module has clear similarities to particle filters, such as its non-parametric approximation of probability distributions. However we have designed it to support specific requirements of the Thousand Brains Theory - properties which we believe neurons have - such as binding information to points in a reference frame.
So in some ways, you can think of the Thousand-Brains Project as leveraging concepts similar to SLAM or particle filters to model all structures in the world (including abstract spaces), rather than just environments. However, it is also more than this. For example, the capabilities of the system to model the world and move in it magnify due to the processing of many, semi-independent modeling units, and the ways in which these units interact.
There are interesting similarities between swarm intelligence and the Thousand Brains Theory. In particular, thousand-brains systems leverage many semi-independent computational units, where each one of these is a full sensorimotor system. As such, the TBT recognizes the value of distributed, sensorimotor processing to intelligence. However, the bandwidth and complexity of the coordination is much greater in the cortex and thousand-brains systems than what could occur in natural biological swarms. For example, columns share direct neural connections that enable them to form compositional relations, including decomposing complex actions into simpler ones.
It might be helpful to think of the difference between prokaryotic organisms that may cooperate to some degree (such as bacteria creating a protective biofilm), vs. the complex abilities of eukaryotic organisms, where cells cooperate, specialize, and communicate in a much richer way. This distinction underlies the capabilities of swarming animals such as bees, which, while impressive, do not match the intelligence of mammals. In the long-term, we imagine that Monty systems can use communicating agents of various complexity, number and independence as required.
Deep learning is a powerful technology - we use large-language models ourselves on a daily basis, and systems such as AlphaFold are an amazing opportunity for biological research. However, we believe that there are many core assumptions in deep learning that are inconsistent with the operating principles of the brain. It is often tempting when implementing a component in an intelligent system to reach for a deep learning solution. However, we have made most conceptual progress when we have set aside the black box of deep learning and worked from basic principles of known neuroscience and the problems that brains must solve.
As such, there may come a time where we leverage deep learning components, particularly for more "sub-cortical" processing such as low-level feature extraction, and model-free motor policies (see below), however we will avoid this until they prove themselves to be absolutely essential. This is reflected in our request that code contributions are in Numpy, rather than PyTorch.
Reinforcement learning (RL) can be divided into two kinds, model-free and model-based. Model-free RL can be used by the brain, for example, to help you proficiently and unconsciously ride a bicycle by making fine adjustments in your actions in response to feedback. Current deep reinforcement learning algorithms are very good at this (Mnih et al, 2015). However, when you learned to ride a bicycle, you likely watched your parents give a demonstration, listened to their explanation, and had an understanding of the bicycle's shape and the concept of pedaling before you even started moving on it. Without these deliberate, guided actions, it could take thousands of years of random movement in the vicinity of the bicycle until you figured out how to ride it, as positive feedback (the bicycle is moving forward) is rare.
All of these deliberate, guided actions you took as a child were "model-based", i.e. dependent on models of the world. These models are learned in an unsupervised manner, without reward signals. Mammals are very good at this, as demonstrated by Tolman's classic experiments with rats in the 1940s. However, how to learn and then leverage these models in deep reinforcement learning is still a major challenge. For example, part of DeepMind's success with AlphaZero (Silver et al, 2018) was the use of explicit models of game-board states. However, for most things in the world, these models cannot be added to a system like the known structure of a Go or chess board, but need to be learned in an unsupervised manner.
While this remains an active area of research in deep-reinforcement learning (Hafner et al, 2023), we believe that the combination of 3D, structured reference frames with sensorimotor loops will be key to solving this problem. In particular, thousand brains systems learn (as the name implies) thousands of semi-independent models of objects through unsupervised, sensorimotor exploration. These models can then be used to decompose complex tasks, where any given learning module can propose a desired "goal-state" based on the models that it knows about. This enables tasks of arbitrary complexity to be planned and executed, while constraining the information that a single module needs to learn about the world. Finally, the use of explicit reference frames increases the speed at which learning takes place, and enables planning arbitrary sequences of actions. Like Tolman's rats, this is similar to how you can navigate around a room depending on what obstacles there are, such as an office chair that has been moved, without needing to learn it as a specific sequence of movements.
In the long term, there may be a role for something like deep-reinforcement learning to support the model-free, sub-cortical processing of thousand-brains systems. However, the key open problem, and the one that we believe the TBT will be central to, is unlocking the model-based learning of the neocortex.
Attempts to explain how back-propagation might exist in the brain require problematic assumptions, such as 1:1 associations between neurons that are not observed experimentally. Furthermore, systems reliant on back-propagation display undesirable learning characteristics, such as catastrophic forgetting, and the requirement for large amounts of training data.
In Monty, we make use of associative learning, together with a strong spatial inductive bias, to enable rapid learning without the use of back-propagation. Although we do not currently use modifiable "weights", one way to conceptualize this is as a form of Hebbian learning in a Hopfield (associative memory) network, where memories are embedded in reference frames, rather than a single, homogeneous population of neurons.
As we introduce hierarchy with potentially distant dependencies, there are a few properties of the Thousand Brains Theory that should make learning less dependent on long-range credit assignment as compared to traditional neural networks:
- Columns throughout the brain (and therefore Monty learning modules) have direct sensory input and motor output, and are therefore able to build predictive models of the world using information locally available to them. As such, a significant amount of learning involves a very flat hierarchy.
- The vast majority of learning is based on predicting the nature of the world in an unsupervised manner, i.e. it is a very dense learning signal, unlike trying to learn sensorimotor systems end-to-end using reinforcement learning.
- The use of a strong, spatial inductive bias significantly reduces the number of samples required to build a representation that can generalize.
It is also interesting to consider how learning occurs in humans when "credit-assignment" is required. For example, long-distance, sparse associations can occur after achieving a reward at the end of a complex task. However, learning in this setting involves the use of explicit, causal world models to understand what helped - for example, "I flipped a particular switch, which opened the door, and then I was able to climb onto the platform. Based on my causal knowledge of the world, the fact that I was humming a tune while I did this probably did not contribute to my success."
Such explicit credit-assignment is not limited to reinforcement learning. When you learn a new, compositional object, you develop a representation by iteratively learning shallow levels of composition, and building upon these. This is why it is important when children learn to read that they first learn to recognize the individual letters, which are composed of strokes. Once letters are recognized, children can learn how letters form words, and so on. This learning does not take place by showing children blocks of text, and expecting them to passively develop representations in a deep hierarchy through a form of error propagation.
This is different from the implicit, model-free credit assignment that neural networks use. It is also why, even with biologically implausible error transport (i.e. back-prop for credit-assignment), deep learning models need to train on orders of magnitude more data than humans to achieve comparable performance.
We believe that there is limited evidence that deep learning systems, including generative pre-trained transformers (GPTs) and diffusion models, can learn sufficiently powerful "world models" for true machine intelligence. For example, representations of objects in deep learning systems tend to be highly entangled and divorced of concepts such as cause-and-effect (see e.g. Brooks, Peebles et al, 2024), in comparison to the object-centric representations that are core to how humans represent the world even from an extremely young age (Spelke, 1990). Representations are also often limited in structure, manifesting in the tendency of deep learning systems to classify objects based on texture more than shape (Gavrikov et al, 2024), an entrenched vulnerability to adversarial examples (Szegedy et al, 2013), the tendency to hallucinate information, and the idiosyncrasies of generated images (such as inconsistent numbers of fingers on hands), when compared to the simpler, but much more structured drawings of children.
Instead, these systems appear to learn complex input-output mappings, which are capable of some degree of composition and interpolation between observed points, but limited generalization beyond the training data. This makes them useful for many tasks, but requires training on enormous amounts of data, and limits their ability to solve benchmarks such as ARC-AGI, or more importantly, make themselves very useful when physically embodied. This dependence on input-output mappings means that even approaches such as chain-of-thought or searching over the space of possible outputs (e.g. the recent o1 models), are more akin to searching over a space of learned "Type 1" actions, rather than the true "Type 2" (Stanovich and West, 2000), model-based thinking that is a marker of intelligence.
How Does Hierarchical Composition Relate to the Hierarchical Features in Convolutional Neural Networks?
In CNNs, and deep-learning systems more generally, there is often a lack of “object-centric” representations, which is to say that when processing a scene with many objects, the properties of these tend to be mixed up with one another. This is in contrast to humans, where we understand the world as being composed of discrete objects with a degree of permanence, and where these objects have the ability to interact with one another.
Furthermore, any given object in our brain is represented spatially, where the shape of the object - i.e. the relative arrangement of features - is far more important than low-level details like a texture that might correlate with a particular object class. This is how we see a person in the famous Vertumnus painting by Arcimboldo, despite all the local features being pieces of fruit. Again, this is different from how CNNs and other deep-learning systems learn to represent objects.
Importantly, such object-centric and spatially-structured representations do not just exist at one level of abstraction, but throughout various levels of hierarchy, from how you understand something like an iris or a fingernail, all the way up to your representation of a person. This continues to extend upwards to abstract concepts like society, where representations continue to be discrete, structured objects.
So while there is hierarchy in both CNNs and the human visual system, the former can be thought of more as a bank of filters that detect things like textures and other correlations between input pixels and output labels. We believe that in the brain, every level of the hierarchy represents discrete objects with their own structure and associated motor policies. These can be rapidly composed and recombined, enabling a wide range of representations and behaviors to emerge.
Concepts from symbolic AI have some similarities to the Thousand Brains Theory, including the importance of discrete entities, and mapping how these are structurally related to one another. However, we believe that it is important that representations are grounded in a sensorimotor model of the world, whereas symbolic approaches typically begin at high levels of abstraction.
However, the approach we are adopting contrasts to some "neuro-symbolic" approaches that have been proposed. In particular, we are not attempting to embed entangled, object-impoverished deep-learning representations within abstract, symbolic spaces. Rather, we believe that object-centric representations using reference frames should be the representational substrate from the lowest-level of representations (vision, touch) all the way up to abstract concepts (languages, societies, mathematics, etc.). Such a commonality in representation is consistent with the re-use of the same neural hardware (the cortical column) through the human neocortex, from sensory regions to higher-level, "cognitive" regions.
Can Monty be Used for a Scientific Problem, Like Mapping the Genetic Sequence of a Protein to its 3D Structure?
This depends a lot on how data is available. Given a static dataset of genetic sequences and their mapping onto the 3D structures of their proteins, Monty is not going to work well, while this is where a function-approximation algorithm like deep-learning can excel.
Where Monty would eventually shine is when it is able to control experimental devices that allow it to further probe information about the world, i.e. a sensorimotor environment. For example, Monty might have some hypotheses about a structure, and want to test these through various experiments, probing the space in which it is uncertain. We embed representations in 3D coordinates that can take on any kind of graph structure necessary which is <= 3D space (strings, graphs defined by edges, or 3D point-clouds), so in theory these kinds of entities can all be represented.
How Monty would learn to generalize a mapping between these levels of representations remains outstanding, and relates to how Monty can learn a mapping between different spaces (e.g. meeting a new family, and mapping these people onto an abstract family-tree structure). We are still figuring out exactly how this would work in a simpler case like the family-tree. In the protein case, the rules are much more complex, and so learning this is definitely not something that Monty can do now. However, just like a human scientist, the ultimate aim would be for Monty to learn to do this mapping based on a causal understanding of the world, informed by the above mentioned experiments, and in contrast to an end-to-end black-box function approximation.
You can also read more about applications of Monty under the suggested criteria.
The implementation of Monty contains all the basic building blocks that were described in the documentation section. To recap, the basic components of Monty are: Sensor modules (SM) to turn raw sensory data into a common language; learning modules (LM) to model incoming streams of data and use these models for interaction with the environment; motor system(s) to translate abstract motor commands from the learning module into the individual action space of agents; and an environment in which the system is embedded and which it tries to model and interact with. The components within the Monty model are connected by the Cortical Messaging Protocol (CMP) such that basic building blocks can be easily repeated and stacked on top of each other. Any communication within Monty is expressed as features at poses (relative to the body) where features can also be seen as object IDs and poses can be interpreted in different ways. For example, pose to the motor system is a target to move to, pose to another LMs input channel is the most likely pose, and poses to the vote connections are all possible poses. All these elements are implemented in Python in the git repository https://github.com/thousandbrainsproject/tbp.monty and will be described in detail in the following sections.
The classes in the Monty code base implement the abstract concepts described above. Each basic building block of Monty has its own customizable abstract class. Additionally, we have an experiment class that wraps around all the other classes and controls the experiment workflow to test Monty.
The main testbed for Monty is currently focused on object and pose recognition. This also involves learning models of objects and interacting with the environment but it all serves the purpose of recognizing objects and their poses. In the future, this focus might shift more towards the interaction aspect where recognizing objects is mostly fulfilling the purpose of being able to meaningfully interact with the environment.
This video gives a high level overview of all the components in Monty and the custom versions of them that we currently use for our benchmark experiments. It goes into some more depth on the evidence based learning module (our current best LM implementation) and voting.
[block:embed] { "html": "<iframe class="embedly-embed" src="//cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FbkwY4ru1xCg%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DbkwY4ru1xCg&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FbkwY4ru1xCg%2Fhqdefault.jpg&type=text%2Fhtml&schema=youtube" width="854" height="480" scrolling="no" title="YouTube embed" frameborder="0" allow="autoplay; fullscreen; encrypted-media; picture-in-picture;" allowfullscreen="true"></iframe>", "url": "https://www.youtube.com/watch?v=bkwY4ru1xCg", "title": "2023/01 - A Comprehensive Overview of Monty and the Evidence-Based Learning Module", "favicon": "https://www.youtube.com/favicon.ico", "image": "https://i.ytimg.com/vi/bkwY4ru1xCg/hqdefault.jpg", "provider": "https://www.youtube.com/", "href": "https://www.youtube.com/watch?v=bkwY4ru1xCg", "typeOfEmbed": "youtube" } [/block]
Learning modules are the core modeling system of Monty. They are responsible for learning models from the incoming data (either from a sensor module or another learning module). Their input and output formats are features at a pose. Using the displacement between two consecutive poses, they can learn object models of features relative to each other and recognize objects that they already know, independent of where they are in the world. How exactly this happens is up to each learning module and we have several different implementations for this.
Generally, each learning module contains a buffer, which functions as a short term memory, and some form of long term memory that stores models of objects. Both can then be used to generate hypotheses about what is currently being sensed, update, and communicate these hypotheses. If certainty about a sensed object is reached, information from the buffer can be processed and integrated into the long term memory. Finally, each learning module can also receive and send target states, using a goal state generator, to guide the exploration of the environment.
The details of specific graph LM implementations, approaches, results, and problems are too much for this overview document. They are written out in a separate document here. Other approaches that we tried but discontinued can be found in our monty_lab repository.
| List of all learning module classes | Description |
|---|---|
| LearningModule | Abstract learning module class. |
| GraphLM | Learning module that contains a graph memory class and a buffer class. It also has properties for logging the target and detected object and pose. It contains functions for calculating displacements, updating the graph memory and logging. Class is not used on its own but is super-class of DisplacementGraphLM, FeatureGraphLM, and EvidenceGraphLM. |
| DisplacementGraphLM | Learning module that uses the displacements stored in graph models to recognize objects. |
| FeatureGraphLM | Learning module that uses the locations stored in graph models to recognize objects. |
| EvidenceGraphLM | Learning module that uses the locations stored in graph models to recognize objects and keeps a continuous evidence count for all its hypotheses. |
Since we currently focus on learning modules that use 3D object graphs, we will explain the workings of Monty on this example here in some more detail. Using explicit 3D graphs makes visualization more intuitive and makes the system more transparent and easier to debug. This does not mean that we think the brain stores explicit graphs! We are using graphs while we are nailing down the framework and the Cortical Messaging Protocol. This is just one possible embodiment of a learning module. How we represent graphs inside a learning module has no effect on the other components of Monty or the general principles of Monty. Explicit graphs are the most concrete type of model that uses reference frames and helped us think through a lot of problems related to that so far. In the future, we may move away from explicit graphs towards something more like HTM in combination with grid-cell like mechanisms but for now, they can help us understand the problems and possible solutions better.
The evidence-based LM is currently the default LM used for benchmarking the Monty system. We will therefore go into a bit more detail on this in Evidence Based Learning Module. The other approaches listed above are not under active development. DisplacementGraphLM and FeatureGraphLM are maintained but do not support hierarchy.
There are currently three flavors of graph matching implemented: Matching using displacements, matching using features at locations, and matching using features at locations but with continuous evidence values for each hypothesis instead of a binary decision. They all have strengths and weaknesses but are generally successive improvements. They were introduced sequentially as listed above and each iteration was designed to solve problems of the previous one. Currently, we are using the evidence-based approach for all of our benchmark experiments.
Displacement matching has the advantage that it can easily deal with translated, rotated and scaled objects and recognize them without additional computations for reference frame transforms. If we represent the displacement in a rotation-invariant way (for example as point pair features) the recognition performance is not influenced by the rotation of the object. For scale, we can simply use a scaling factor for the length of the displacements which we can calculate from the difference in length between the first sensed displacement and stored displacements of initial hypotheses (assuming we sample a displacement that is stored in the graph, which is a strong assumption). It is the only LM that can deal with scale at the moment. The major downside of this approach is that it only works if we sample the same displacements that are stored in the graph model of the object while the number of possible displacements grows explosively with the size of the graph.
Feature matching addresses this sampling issue of displacement matching by instead matching features at nearby locations in the learned model. The problem with this approach is that locations are not invariant to the rotation of the reference frame of the model. We, therefore, have to cycle through different rotations during matching and apply them to the displacement that is used to query the model. This however is more computationally expensive.
Both previous approaches use a binary approach to eliminate possible objects and poses. This means that if we get one inconsistent observation, the hypothesis is permanently eliminated from the set of possible matches. The evidence-based LM deals with this issue by assigning a continuous evidence value to each hypothesis which is updated with every observation. This makes the LM much more robust to noise and new sampling. Since the set of hypotheses retains the same size over the entire episode we can also use more efficient matrix multiplications and speed up the recognition procedure. The evidence count also allows us to have a most likely hypothesis at every step, even if we have not converged to a final classification yet. This is useful for further hierarchical processing and action selection at every step.
Overall, matching with displacements can deal well with rotated and scaled objects but fails when sampling new displacements on the object. Feature matching does not have this sampling issue but instead requires a more tedious search through possible rotations and scale is an open problem. Evidence matching uses the mechanisms of feature matching but makes them more robust by using continuous evidence counts and updating the evidence with efficient matrix multiplications.
The Monty class contains everything an agent needs to model and interact with the environment. It contains (1) sensor modules (2) sensorimotor learning systems (also called learning modules), and (3) a motor system for taking action (see here). The Monty class manages the communication between these components using the cortical messaging protocol (see here)
A Monty instance can be arbitrarily customized as long as it implements a handful of types of abstract methods listed below.
A Monty class should have:
-
Step and Voting methods, which define the modeling and action logic, and communications logic respectively
-
Methods for saving, loading, and logging
-
Methods called at event markers like pre-episode and post-episode
Below are the arguments associated with the Monty classes.
-
a list of SensorModule instances, each of which is responsible for processing raw sensory input and transforming it into a canonical format that any LearningModule can operate on.
-
a list of LearningModule instances, each of which is responsible for building models of objects given outputs from a sensor module
-
a dictionary mapping sensors to an agent (
sm_to_agent_dict) -
a matrix describing the coupling from SensorModule to LearningModule (
sm_to_lm_matrix) -
a matrix describing the coupling between LearningModules used for voting (
lm_to_lm_vote_matrix) -
a motor system responsible for moving the agent
-
the Monty class used (
monty_class) and its arguments (monty_args)
Using the above arguments we can specify the structure underlying our modeling system. For instance, if we have five sensors in the environment we would specify five sensor modules, each corresponding to one sensor (often we use an additional sensor module connected to a view finder sensor which does not connect to a learning module). Each sensor module would be connected to one learning module and the connection between the learning modules is specified in the lm_to_lm_matrix. The modeling in this Monty instance could then look as shown in the figure below.
| List of all Monty classes | Description |
|---|---|
| Monty | Abstract Monty class from which all others inherit. Defines step method framework and other communication elements like voting. |
| MontyBase | Implements all the basic functionalities for stepping LMs and SMs and routing information between Monty components. |
| MontyForGraphMatching | Implements custom step and vote functions adapted for graph-based learning modules. Also adds custom terminal conditions for object recognition and determining possible matches. |
| MontyForEvidenceGraphMatching | Customizes previous class with a voting function designed for the evidence-based LM. Also customizes motor suggestions to use evidence-based models. |
Before sending information to the sensor module which extracts features and poses, we can apply transforms to the raw input. Possible transforms are listed in tables below. Transforms are applied to all sensors in an environment before sending observations to the SMs and are specified in the environment interface arguments.
| List of all transform classes | Description |
|---|---|
| MissingToMaxDepth | Habitat depth sensors return 0 when no mesh is present at a location. Instead, return max_depth. |
| AddNoiseToRawDepthImage | Add gaussian noise to raw sensory input. |
| DepthTo3DLocations | Transform semantic and depth observations from camera coordinates (2D) into agent (or world) coordinates (3D). Also estimates whether we are on the object or not if no semantic sensor is present. |
The transformed, raw input is then sent to the sensor module and turned into the CMP-compliant format. The universal format that all sensor modules output is features at pose in 3D space. Each sensor connects to a sensor module which turns the raw sensory input into this format of features at locations. Each input therefore contains x, y, z coordinates of the feature location relative to the body and three orthonormal vectors indicating its rotation. In sensor modules these pose-defining vectors are defined by the surface normal and principal curvature directions sensed at the center of the patch. In learning modules the pose vectors are defined by the detected object rotation. Additionally, the sensor module returns the sensed pose-independent features at this location (e.g. color, texture, curvature, ...). The sensed features can be modality-specific (e.g. color for vision or temperature for touch) while the pose is modality agnostic.
| List of all sensor module classes | Description |
|---|---|
| SensorModule | Abstract sensor module class. |
| HabitatSM | Sensor module for HabitatSim. Extracts pose and features in CMP format from an RGBD patch. Keeps track of agent and sensor states. Also checks if observation is on object and should be sent to LM. Can be configured to add feature noise. |
| Probe | A probe that can be inserted into Monty in place of a sensor module. It will track raw observations for logging, and can be used by experiments for positioning procedures, visualization, etc. What distinguishes a probe from a sensor module is that it does not process observations and does not emit a Cortical Message. |
Each sensor module accepts noise_params, which configure the DefaultMessageNoise that adds feature and location noise to the created Cortical Message (State) before sending. Features and location noise can be configured individually.
Each sensor module accepts delta_thresholds, which configure a FeatureChangeFilter that may set the use_state state attribute to False if sensed features did not change significantly between subsequent observations. Significance is defined by the delta_thresholds parameter for each feature.
For an overview of which type of data processing belongs where please refer to the following rules:
-
Any data transforms that apply to the entire dataset or to all SMs can be placed in the dataset.transform.
-
Any data transforms that are specific to a sensory modality belong in SM. For example, an auditory SM could include a preprocessing step that computes a spectrogram, followed by a convolutional neural network that outputs feature maps.
-
Transforming locations in space from coordinates relative to the sensor, to coordinates relative to the body, happens in the SM.
-
The output of both SMs and LMs is a State class instance containing information about pose relative to the body and detected features at that pose. This is the input that any LM expects.
In this implementation, some features are extracted using all of the information in the sensor patch (e.g. locations of all points in the patch for surface normal and curvature calculation) but then refer to the center of the patch (e.g. only the curvature and surface normal of the center are returned). At the moment all the feature extraction is predefined but in the future, one could also imagine some features being learned.
Each Sensor Module needs to extract a pose from the sensory input it receives. This pose can be defined by the surface normal and the two principal curvature vectors. These three vectors are orthogonal to each other, where the surface normal is the vector perpendicular to the surface and pointing away from the object, and the two principal curvature vectors point in the directions of the greatest and least curvature of the surface.
We can use the surface normal (previously referred to as point-normal) and principal curvature to define the orientation of the sensor patch. The following video describes what those represent. Surface Normals and Principle Curvatures
The Cortical Messaging Protocol is defined in the State class. The output of every SM and LM is an instance of the State class which makes sure it contains all required information. The required information stored in a State instance is:
-
location (relative to the body)
-
morphological features: pose_vectors (3x3 orthonormal), pose_fully_defined (bool), on_object (bool)
-
non-morphological features: color, texture, curvature, ... (dict)
-
confidence (in [0, 1])
-
use_state (bool)
-
sender_id (unique string identifying the sender)
-
sender_type (string in ["SM", "LM"])
The State class is quite general and depending who outputs it, it can be interpreted in different ways. As output of the sensor module, it can be seen as the observed state. When output by the learning module it can be interpreted as the hypothesized or most likely state. When it is the motor output of the LM it can be seen as a goal state (for instance specifying the desired location and orientation of a sensor or object in the world). Lastly, when sent as lateral votes between LMs we send a list of state class instances which can be interpreted as all possible states (where states do not contain non-morphological, modality-specific features but only pose information associated with object IDs).
title: Open Questions description: Below is a simple outline of some of the open questions that we are currently exploring.
For more details, we also encourage checking out our Future Work Roadmap and related sections, where we go into some possible approaches to these questions.
- How are object behaviors represented?
- How are they recognized?
- Where do we have a model of general physics? Can every LM learn the basic physics necessary for the objects it models (e.g. fluid-like behavior in some, cloth-like behavior in others)? Or are some LMs more specialized for this?
- How are they represented?
- How are they recognized?
- How is an object recognized irrespective of its scale?
- How do we know the scale of an object?
- How does it work that I can learn an object using vision but then recognize it using touch? (in Monty the Sensor Module can simply be switched out but how would it work in the brain? Or how would it work in a multimodal Monty system without rewiring the SM to LM connections or explicitly copying models?)
- Should there be some form of memory consolidation?
- How do we learn where an object begins and where it ends? What defines a sub-component of an object?
- How are goal states decomposed into sub-goals?
- How does the motor system decide which goal state to carry out (given that every/many LMs produce goal states)?
- What is the routing mechanism?
- How does attention come into play here?
- Can we assume that models of the same object in different LMs were learned using the same reference frame/in the same orientation?
The policy is generated by the motor system and outputs actions for the agent's action space. If there are multiple agents in an environment, they will all have their own policy and may have different action spaces.
We currently have two broad types of action spaces for our agents. The first is a camera-like action space for what we call the distant agent, due to its physical separation from the surface of the object it is sensing. Like an eye, the action space is that of a "ball-and-socket" actuator that can look in different directions. In addition, the "socket" itself can move if directed by the hypothesis-driven policy (more on that in a moment).
The second action space is a more flexible one that allows easier movement through space, but is designed to move along the surface of an object (e.g. like a finger), hence we call it the surface agent (see figure below).
For both the distant agent and surface agent, the action space also includes the ability to make instantaneous "jumps" in the absolute coordinates of the environment. These are used by the hypothesis-driven action policy to evaluate candidate regions of an object, with the aim of rapidly disambiguating the ID or pose of the current object.
Given these action spaces, the distant agent remains in one location unless moved by the hypothesis-driven policy, and can tilt its sensors up, down, left, and right to explore the object from one perspective. The surface agent moves close along the surface of the object by always staying aligned with the sensed surface normal ((→ facing the object surface head-on). It can therefore efficiently move around the entire object and theoretically reach any feature on the object even without the hypothesis-driven action policy.
Note that the differences between the agents and action spaces are in some sense arbitrary at this point; however, it is useful to explore inference with both of these approaches, owing to some unique properties that they might each afford:
-
Distant agent:
-
Rapid movements: being far from the surface of an object affords the ability of an agent to make more rapid movements, without risking that these knock over or collide with an object; this might manifest as rapid saccades across the key points of an object.
-
Broad overview: by being distant, an agent can acquire a more broad overview of the structure and potential points of interest on an object, informing areas to rapidly test.
-
Simplicity: a distant agent is easier to implement in practice as it does not require a fine-tuned policy that can move along an object surface without colliding with the object or leaving the surface. Furthermore, the action space is generally easier to realize, e.g. it does not require a complex supporting body for something like a "finger" moving in the real world.
-
-
Surface agent:
-
Additional modalities: all of the sensory modalities of the distant agent are limited to forms of propagating waves, in particular electromagnetic radiation (such as visible light), and sound waves; in contrast, the surface agent has the potential to feel physical textures directly, and as cameras are easy to come by (unlike biological eyes...) we can easily include one in our surface agent.
-
Fidelity: related to the above, one would imagine that the depth or temperature readings of a finger-like sensor exploring an object will be more accurate than those of a distant-agent relying on electromagnetic radiation alone; furthermore, the movements that constrain a surface-agent to move along the surface may provide an additional signal for accurately estimating the curvature of objects.
-
Interactivity: by proximity to an object, components of a Monty system that are "surface agents" are also likely able to interact with an object, affording additional capabilities.
-
Object enclosure: by using an array of 3D sensors that are still relatively nearby (such as the fingers on a hand), surface agents can more easily surround an object from all sides and instantaneously reduce object self-occlusion that otherwise affects distant-agent observations. By moving on the surface of an object it also significantly reduces the risk of other objects occluding the object of interest and makes it easier to isolate the object and detect when moving onto a new object.
-
Before an experiment starts, the agent is positioned to an appropriate starting position relative to the object. This serves to setup the conditions desired by the human operator for the Monty agent, and is analogous to a neurophysiologist lifting an animal to place it in a particular location and orientation in a lab environment. As such, these are considered positioning procedures, in that they are not driven by the intelligence of the Monty system, although they currently make use of its internal action spaces. Furthermore, sensory observations that occur during the execution of a positioning procedure are not sent to the learning module(s) of the Monty system, as they have access to privileged information, such as a wider field-of-view camera. Two such positioning procedures exist, one for the distant agent (GetGoodView), and one for the surface agent (touch_object).
For the former, the distant agent is moved to a "good view" such that small and large objects in the environment cover approximately a similar space in the camera image (see figure below). To determine a good view we use the view-finder which is a camera without zoom and which sees a larger picture than the sensor patch. Without GetGoodView, small objects such as the dice may be smaller than the sensor patch, thereby preventing any movement of the sensor patch on the object (without adjusting the action amount). For large objects, there is a risk that the agent is initialized inside the object as shown in the second image in the first row of the below figure.
touch_object serves a similar purpose - the surface agent is moved sufficiently close such that it is essentially on the surface of the object. This will be important in future work when the surface agent has access to sensory inputs, such as texture, that require maintaining physical contact with an object.
Some more details on the positioning procedures are provided below.
- This is called by the distant agent in the pre-episode period, and makes use of the view-finder. To estimate whether it is on an object, it can either make use of the semantic-sensor (which provides ground-truth information about whether an object is in view and its label), or it can approximate this information using a heuristic based on depth-differences.
- Information in the view-finder is used to orient the view-finder, and the associated sensor-patch(es) onto the object, before moving closer to the object as required.
- Contains additional logic for handling multiple objects, in particular making sure the agent begins on the target object in an experiment where there are distractor objects. This is less relevant for the surface agent, as currently multi-object experiments are only for the distant agent.
- Key parameters that determine the behavior of the positioning procedure are
good_view_percentageanddesired_object_distance. The primary check of the algorithm is to compareperc_on_target_obj(the percent of the view-finder's visual field that is filled by the object) against the desiredgood_view_percentage.closest_point_on_target_objsimply serves to ensure we don't get too close to the object
- This can be called by the surface agent when determining the next action (even within an episode), and makes use of the view-finder, but not the semantic-sensor.
- Contains a search-loop function that will orient around to find an object, even if it is not visible in the view-finder.
- The key parameter is
desired_object_distance, which reflects the effective length of the agent and its sensors as it moves along the surface of the object.
As the positioning procedures were implemented in the early development of Monty, there are a number of aspects which we plan to adjust in the near future, and which should be reflected in any new positioning procedures.
- Positioning procedures should only be called in
pre_episode, i.e. before the episode begins. - Positioning procedures may make use of privileged information such as the view-finder and semantic-sensor, which are not available to the learning module. However, consistent with point (1), these should not be leveraged during learning or inference by the Monty agent.
These requirements are currently not enforced in the use of GetGoodView and touch_object when an agent jumps to a location using a model-based policy. Similarly, touch_object is used by the surface agent if it loses contact with the object and cannot find it. Appropriately separating out the role of positioning procedures via the above requirements will clarify their role, and enable policies that do not make use of privileged information, but which serve similar purposes during an experiment.
We can divide policies into two broad categories: Input-driven policies which use data directly from the sensor to determine the next action, and hypothesis-driven policies which use goal states from the learning module to determine the next action. Besides this, we can also use a largely random policy that is not influenced by any factors. The input-driven policies are a bit more simplistic but can already have powerful effects while the hypothesis-driven policies can actively make use of the learned object models and current hypotheses to quickly narrow down the object ID and enable a more efficient exploration of an object. Importantly, input and hypothesis-driven policies can work in concert for maximal efficiency.
Input-driven policies are based on sensory input from the SM. It can be compared to a reflexive behavior where the agent reacts to the current sensory input without processing the input further or using models of the world to decide what to do. This direct connection between SM and motor policy currently does not enforce the CMP but might in the future. The hypothesis driven policies are based on goal states that are output from the goal state generator of the learning modules. These goal states follow the CMP and therefore contain a pose and features. Those are interpreted as target poses of objects that should be achieved in the world and get translated into motor commands in the motor system.
In the current default setup, observations of the distant agent are collected by random movement of the camera. The only input-driven influence is that if the sensor patch moves off the object, the previous action is reversed to make sure we stay on the object. Policies with random elements also have a momentum parameter (alpha) which regulates how likely it is to repeat the previous action and allows for more directional movement paths. If more predictable behavior is desired, pre-defined action-sequences can also be defined, and these are frequently employed in unit tests.
The surface agent can either use a random walk policy (again with an optional momentum parameter to bias following a consistent path), or alternatively make use of the "curvature-informed" policy. This policy makes use of sensed principle curvature directions (where they are present), attempting to then follow these, such as on the rim of a cup, or the handle of a mug. The details of this policy are expanded upon further below.
Finally, both the distant and surface agent can make use of the hypothesis-driven action policy.
The policies mentioned above are aimed at efficient inference. There is also a specialized policy that can be used to ensure good object coverage when learning about new objects called NaiveScanPolicy. This policy starts at the center of the object and moves outwards on a spiral path. This policy makes the most sense to use with the MontySupervisedObjectPretraining Experiment and is written for the distant agent. For hypothesis-driven policies during exploration, one would want to go to locations on the object that are not well represented in the model yet (not implemented). Exploration policies generally aim at good object coverage and exploring new areas of an object while inference policies aim at efficiently viewing unique and distinguishing features that are well represented in the object model.
Examples of different policies in action
| List of all policy classes | Description |
|---|---|
| MotorPolicy | Abstract policy class. |
| BasePolicy | Policy that randomly selects actions from the action space. The switch_frequency parameter determines how likely it is to sample a new action compared to repeating the previous action. There is also an option to provide a file with predefined actions that are executed in that order. |
| InformedPolicy | Receives the current observation as input and implements basic input-driven heuristics. This policy can search for the object if it is not in view, move to the desired distance to an object, and make sure to stay on the object by reversing the previous action if the sensor moves off the object. |
| NaiveScanPolicy | Policy that moves in an outwards spiral. Is used for supervised pretraining with the distant agent to ensure even object coverage. |
| SurfacePolicy | Moves perpendicular to the object's surface at a constant distance using the surface normals. This is also a type of InformedPolicy. |
| SurfacePolicyCurvatureInformed | Custom SurfacePolicy that follows the direction of the two principal curvatures and switches between them after a while. |
| HypothesisDrivenPolicyMixin | The hypothesis-driven policy is implemented as a mixin, such that it can be adopted by a variety of other core policies, such as the informed distant-agent policy, or the surface-agent, curvature-informed policy. |
Due to their relative complexity, the below sub-sections provide further detail on the curvature-informed and hypothesis-driven policies.
The curvature-informed surface-agent policy is designed to follow the principal curvatures on an object's surface (where these are present), so as to efficiently explore the relevant "parts" of an object. For example, this policy should be more likely than a random walk with momentum to efficiently explore components such as the rim of a cup, or the handle of a mug, when these are encountered.
To enable this, the policy is informed by the principal curvature information available from the sensor-module. It will alternate between following the minimal curvature (e.g. the rim of a cup) and the maximal curvature (e.g. the curved side of a cylinder). The decision process that guides the curvature-guided policy is elaborated in plain English in the below chart), and in more coding-terminology in Figure TODO.
🚧 TODO: add a figure for coding-terminology chart
Avoidance of previous locations: In addition to being guided by curvature, the policy has the ability to check whether the direction in which it is heading will bring it back to a previously visited location; if so, it will attempt to find a new heading that is more likely to explore a new region of the object. Some further points on this are:
-
We typically limit the temporal horizon over which the agent performs this check, such that it won't worry about revisiting a location seen in the distant past. Furthermore, the angle (i.e. projected cone) over which we consider a previous location as being on our current heading-path progressively decreases as we iteratively search for a new direction. In other words, the more time we spend looking for a new heading, the more tolerant we are to passing nearby to previously visited locations. Both of these elements help reduce the chance of getting stuck in situations where the agent is surrounded by previously visited locations, and cannot find a path out of the situation.
-
To further limit the risk of getting stuck in such encircled regions, once the agent has selected an avoidance heading, it will maintain it for a certain period of time before checking for conflicts again.
-
Note that in the future, we may wish to sometimes intentionally return to previous locations when we have noisy self-movement information, so as to correct our estimated path-integration.
In the figure below, two examples of the policy in action are shown.
For the hypothesis-driven action policy, the Monty system uses its learned, internal models of objects, in combination with its hypotheses about the current ID and pose of the object it is perceiving, to propose a point in space to move to. This testing point should help disambiguate the ID and/or pose of the object as efficiently as possible. This policy is currently for LMs that use explicit 3D graphs to model objects.
Selecting a Test Point: A simple heuristic for where to test an object for this purpose might be to evaluate the point on the most-likely object that is farthest away from where the system currently believes it is. Such a "max-distance" condition is partially implemented, however the main method currently in use involves a moderately more sophisticated comparison of the two most likely hypotheses so as to identify the most distinguishing point to visit.
In particular, this "graph-mismatch" technique takes the most likely and second most likely object graphs, and using their most likely poses, overlays them in a "mental" space. It then determines for every point in the most likely graph, how far the nearest neighbor is in the second graph. The output is the point in the first graph that has the most distant nearest neighbor (Figure below)
As an example, if the most likely object was a mug, and the second most likely object a can of soup, then a point on the handle of the mug would have the most-distant nearest neighbor to the can-of-soup graph. In general, this is a reasonable heuristic for identifying potentially diagnostic differences in the structures of two objects using Euclidean distance-alone, without requiring comparisons in feature-space. Eventually we want to extend this to include distance in feature space to efficiently distinguish objects with the same morphology such as a coke can from a sparkling water can.
In addition to being able to compare the top two most likely objects in such a way, one can just focus on the most likely object, and use this same technique to compare the two most likely poses of the same object.
Determining When to Perform a Hypothesis-Driven Test: There are currently a handful of conditions for when we perform a hypothesis-driven test, summarized below:
-
The system is sufficiently confident about the object ID in question; in this case, Monty will focus on determining pose as described above. With all other indications below, the process remains focused on distinguishing the ID amongst the top two most likely objects.
-
The IDs or order of the top two most-likely object IDs change (e.g. we go from (1: fork, 2: spoon), to (1: spoon, 2: fork))
-
The pose of the most likely object has changed.
-
There has been a significant number of steps without performing a hypothesis-driven jump.
The above conditions can support performing a hypothesis-driven jump, but in addition, it is first necessary that we have taken a sufficient number of steps since our last hypothesis-driven jump, to reduce the probability of continuously jumping over the object, potentially to similar locations. In practice, this minimum number of steps is small however (e.g. 5 steps).
It is worth emphasizing that the long-term view of improving the policy in Monty will certainly leverage various advanced learning techniques, including reinforcement learning and/or evolutionary algorithms. It is important to note therefore that the policies which we have been implementing so far do not mean to be extended to consider every possible scenario an agent might face. Rather, these are intended as a reasonable set of primitives that a more advanced policy might make use of, without having to learn them from the ground up.
It is also worth noting that these same primitives will likely prove useful as we move from 3D physical space, to more abstract spaces. For example, in such settings, we might still want to have the inductive biases that support following dimensions of maximal or minimal variation, or use the differences between internal models to determine test-points.
Note
This section contains features that are already available in Monty but still being evaluated by our researchers and are currently not part of our recommended default Monty setup.
[block:embed] { "html": "<iframe class="embedly-embed" src="//cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FbkwY4ru1xCg%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DbkwY4ru1xCg&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FbkwY4ru1xCg%2Fhqdefault.jpg&type=text%2Fhtml&schema=youtube" width="854" height="480" scrolling="no" title="YouTube embed" frameborder="0" allow="autoplay; fullscreen; encrypted-media; picture-in-picture;" allowfullscreen="true"></iframe>", "url": "https://www.youtube.com/watch?v=bkwY4ru1xCg", "title": "2023/01 - A Comprehensive Overview of Monty and the Evidence-Based Learning Module", "favicon": "https://www.youtube.com/favicon.ico", "image": "https://i.ytimg.com/vi/bkwY4ru1xCg/hqdefault.jpg", "provider": "https://www.youtube.com/", "href": "https://www.youtube.com/watch?v=bkwY4ru1xCg", "typeOfEmbed": "youtube" } [/block]
We explain the evidence-based learning module in detail here since it is the current default LM. However, a lot of the following content applies to all graph-based LMs or LMs in general.
The next sections will go into some more detail on how the hypothesis initialization and update works in the evidence LM. We use the movement on a colored cylinder shown in the figure below as an example.
At the first step of an episode we need to initialize our hypothesis space. This means, we define which objects and poses are possible.
At the beginning of an episode, we consider all objects in an LMs memory as possible. We also consider any location on these objects as possible. We then use the sensed pose features (surface normal and principal curvature direction) to determine the possible rotations of the object. This is done for each location on the object separately since we would have different hypotheses of the object orientation depending on where we assume we are. For example, the rotation hypothesis from a point on the top of the cylinder is 180 degrees different from a hypothesis on the bottom of the cylinder (see figure below, top).
By aligning the sensed surface normal and curvature direction with the ones stored in the model we usually get two possible rotations for each possible location. We get two since the curvature direction has a 180-degree ambiguity, meaning we do not know if it points up or down as we do with the surface normal.
For some locations, we will have more than two possible rotations. This is the case when the first principal curvature (maximum curvature) is the same as the second principal curvature (minimum curvature) which happens for example when we are on a flat surface or a sphere. If this is the case, the curvature direction is meaningless and we sample N possible rotations along the axis of the surface normal.
After initializing the hypothesis space we assign an evidence count to each hypothesis. Initially, this is 0 but if we are also observing pose-independent features such as color or the magnitude of curvature we can already say that some hypotheses are more likely than others (see figure below, bottom). We do not exclude hypotheses based on the feature since we might be receiving noisy data.
To calculate the evidence update we take the difference between the sensed features and the stored features in the model. At any location in the model where this difference is smaller than the tolerance value set for this feature, we add evidence to the associated hypotheses proportional to the difference. Generally, we never use features to subtract evidence, only to add evidence. Therefore, if the feature difference is larger than the tolerance (like in the blue and flat parts of the cylinder model in the figure above) no additional evidence is added. The feature difference is also normalized such that we add a maximum of 1 to the evidence count if we have a perfect match and 0 evidence if the difference is larger than the set tolerance. With the feature_weights parameter, certain features can be weighted more than others.
In the example shown in the figure below, we initialize two pose hypotheses for each location stored in the model except on the top and bottom of the cylinder where we have to sample more because of the undefined curvature directions. Using the sensed features we update the evidence for each of these pose hypotheses. In places of the model where both color and curvature match, the evidence is the highest (red). In places where only one of those features matches, we have a medium-high evidence (yellow) and in areas where none of the features match we add 0 evidence (grey).
At every step after the first one we will be able to use a pose displacement for updating our hypotheses and their evidence. At the first step we had not moved yet so we could only use the sensed features. Now we can look at the difference between the current location and the previous location relative to the body and calculate the displacement.
The relative displacement between two locations can then be used in the model's reference frame to test hypotheses. The displacement is only regarding the location while the rotation of the displacement will still be in the body's reference frame. To test hypotheses about different object rotations we have to rotate the displacement accordingly. We take each hypothesis location as a starting point and then rotated the displacement by the hypothesis rotation. The endpoint of the rotated displacement is the new possible location for this hypothesis. It basically says "If I would have been at location X on the object, the object is in orientation Y and I move with displacement D then I would now be at location Z. All of these locations and rotations are expressed in the object's reference frame.
Each of these new locations now needs to be checked and the information stored in the model at this location needs to be compared to the sensed features. Since the model only stores discrete points we often do not have an entry at the exact search location but look at the nearest neighbors (see section the next section for more details).
We now use both morphology and features to update the evidence. Morphology includes the distance of the search location to nearby points in the model and the difference between sensed and stored pose features (surface normal and curvature direction). If there are no points stored in the model near the search location then our hypothesis is quite wrong and we subtract 1 from the evidence count. Otherwise, we calculate the angle between the sensed pose features and the ones stored at the nearby nodes. Depending on the magnitude of the angle we can get an evidence update between -1 and 1 where 1 is a perfect fit and -1 is a 180-degree angle (90 degrees for the curvature direction due to its symmetry). For the evidence update from the features, we use the same mechanism as during initialization where we calculate the difference between the sensed and stored features. This value can be between 0 and 1 which means at any step the evidence update for each hypothesis is in [-1, 2].
The evidence value from this step is added to the previously accumulated evidence for each hypothesis. The previous evidence and current evidence are both weighted and if their weights add up to 1 the overall evidence will be bounded to [-1, 2]. If they add up to more than 1 the evidence can grow infinitely. The current default values are 1 for both weights such that the evidence grows and no past evidence gets forgotten. However, in the future bounded evidence might make more sense, especially when the agent moves from one object to another and past evidence should not be relevant anymore. Whether bounded or not, the evidence value of a hypothesis expresses the likelihood of a location and rotation hypothesis given the past sequence of displacements and features.
At the next step, the previous steps' search locations become the new location hypotheses. This means that the next displacement starts where the previous displacement ended given the hypothesis.
Evidence updates are performed for all objects in memory. This can be done in parallel since the updates are independent of each other.
For more details, you can have a look at this deep dive into the evidence base learning module: [block:embed] { "html": "<iframe class="embedly-embed" src="//cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FVKWhQJzNpks%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DVKWhQJzNpks&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FVKWhQJzNpks%2Fhqdefault.jpg&type=text%2Fhtml&schema=youtube" width="854" height="480" scrolling="no" title="YouTube embed" frameborder="0" allow="autoplay; fullscreen; encrypted-media; picture-in-picture;" allowfullscreen="true"></iframe>", "url": "https://www.youtube.com/watch?v=VKWhQJzNpks", "title": "2025/02 - Deep Dive on the EvidenceLM", "favicon": "https://www.youtube.com/favicon.ico", "image": "https://i.ytimg.com/vi/VKWhQJzNpks/hqdefault.jpg", "provider": "https://www.youtube.com/", "href": "https://www.youtube.com/watch?v=VKWhQJzNpks", "typeOfEmbed": "youtube" } [/block]
As mentioned above, we look at the nearest neighbors of each search location since the object model only stores discrete locations. To do this we use kd tree search and retrieve max_nneighbors nearest neighbors.
To make sure that the nearest neighbors are actually close to the search location we threshold them using the max_match_distance parameter. If a nearest neighbor is further away than this distance, it does not get considered. This defines the search radius.
When calculating the evidence update for a given hypothesis we first calculate the evidence for all points in the search radius. Then we use the best match (the point with the highest evidence) to update the overall evidence of this hypothesis. This makes sure that we do not introduce noise when features in a model change quickly. As long as there is a good match stored in the model near the search location we add high evidence. It does not matter if the model also contains points that do not match so well but are also nearby.
To incorporate some knowledge about surface into this search we do not use a circular radius but instead, inform it by the sensed surface normal. The idea is that we want to search further into the directions perpendicular to the surface normal (along the surface) than in the direction of the surface normal (off the surface). This is visualized in figure below.
We modulate the influence of the surface normal (or the flatness of the search sphere) by the sensed curvature. If we sense a curvature close to 0 we are on a flat surface and want to have a flat search radius. If we sense a curvature with high magnitude we want to use an almost circular search radius.
To implement these notions we use a custom distance measure that takes the surface normal and curvature magnitude into account. This distance is then used for thresholding and leads to considering more points along the surface and fewer outside the surface (see above figure, bottom row).
As mentioned before, features can only add evidence, not subtract it. Morphology (location and pose feature match) can add and subtract evidence. This is because we want to be able to recognize objects even when features are different. For example, if we have a model of a red coffee mug and are presented with a blue one we would still want to recognize a coffee mug.
The idea is that features can add evidence to make recognition faster but they can not make a hypothesis less likely. This is only halfway achieved right now since we consider relative evidence values and if features add evidence for some hypotheses and not for others it also makes them implicitly less likely and can remove them from possible matches.
A future solution could be to store multiple possible features or a range of features at the nodes. Alternatively, we could separate object models more and have one model for morphology which can be associated with many feature maps (kind of like UV maps in computer graphics). This is still an area of active conceptual development.
We use continuous evidence values for our hypotheses but for some outputs, statistics, and the terminal condition we need to threshold them. This is done using the x_percent_threshold parameter. This parameter basically defines how confident we need to be in a hypothesis to make it our final classification and move on to the next episode.
The threshold is applied in two places: To determine possible matches (object) and possible poses (location and rotation). In both cases, it works the same way. We look at the maximum evidence value and calculate x percent of that value. Any object or pose that has evidence larger than the maximum evidence minus x percent is considered possible.
The larger the x percent threshold is set, the more certain the model has to be in its hypothesis to reach a terminal state. This is because the terminal state checks if there is only one possible hypothesis. If we, for instance, set the threshold at 20%, there can not be another hypothesis with an evidence count above the most likely hypothesis evidence minus 20%.
Besides possible matches and possible poses, we also have the most likely hypothesis. This is simply the maximum evidence of all poses. The most likely hypothesis within one object defines this object's overall evidence and the most likely hypothesis overall (the output of the LM) is the maximum evidence value over all objects and poses.
Finally, when an object has no hypothesis with a positive evidence count, it is not considered a possible match. If all objects have only negative evidence we do not know the object we are presented with and the LM creates a new model for it in memory.
Voting can help to recognize objects faster as it helps integrate information from multiple matches. Generally, the learning module is designed to be able to recognize objects on its own simply through successive movements. With voting we can achieve flash inference by sharing information between multiple learning modules. Note that even though the example in the figure below shows voting between two surface-based sensors, voting also works across modalities. This is because votes only contain information about possible objects and their poses which is modality agnostic.
At each step, after an LM has updated its evidence given the current observation, the LM sends out a vote to all its connected LMs. This vote contains its pose hypotheses and the current evidence for each hypothesis. The evidence values are scaled to [-1, 1] where -1 is the currently lowest evidence and 1 is the highest. This makes sure that LMs that received more observations than others do not outweigh them. We can also only transmit part of the votes by setting the vote_evidence_threshold parameter. For instance if this value is set to 0.8, only votes with a scaled evidence above 0.8 are being send out. This can dramatically reduce runtime.
The votes get transformed using the displacement between the sensed input poses. This displacement also needs to be rotated by the pose hypothesis to be in the model's reference frame. We assume that the models in both LMs were learned at the same time and are therefore in the same reference frame. If this does not hold, the reference transform between the models would also have to be applied here.
Once the votes are in the receiving LMs reference frame, the receiving LM updates its evidence values. To do this, it again looks at the nearest neighbor to each hypothesis location, but this time the nearest neighbors in the votes. The distance-weighted average of votes in the search radius (between -1 and 1) is added to the hypothesis evidence.
Note
See our docs on more details about reference frame transformation in Monty.
In our current experiment set up, we divide time into episodes. Each episode ends when a terminal state is reached. In the object recognition task this is either no match (the model does not know the current object and we construct a new graph for it), match (we recognized an object to correspond to a graph in memory), or time out (we took a maximum number of steps without reaching one of the other terminal states). This means, each episode contains a variable amount of steps and after every step, we need to check if a terminal condition was met. This check is summarized in the figure below (except for time out which is checked separately in the code).
This check can be divided into global Monty checks and local LM checks (purple box). An individual LM can reach its terminal state much earlier than the overall Monty system. For example, if one LM does not have a model of the shown object it will reach no match quite fast while the other LMs will continue until they recognize the object. Or one LM might receive some unique evidence about an object and recognize it before all other LMs do. The episode only ends once min_lms_match LMs have reached the terminal 'match' state.
A learning module can have three types of output at every step. All three outputs are instances of the State class and adhere to the Cortical Messaging Protocol.
The first one is, just like the input, a pose relative to the body and features at that pose. This would for instance be the most likely object ID (represented as a feature) and its most likely pose. This output can be sent as input to another learning module or be read out by the experiment class for determining Monty's Terminal Condition and assessing the model performance.
The second output is the LMs vote. If the LM received input at the current step it can send out its current hypotheses and the likelihood of them to other LMs that it is connected to. For more details of how this works in the evidence LM, see section Voting with evidence
Finally, the LM can also suggest an action in the form of a goal state. This goal state can then either be processed by another learning module and split into subgoals or by the motor system and translated into a motor command in the environment. The goal state follows the CMP and therefore contains a pose relative to the body and features. The LM can for instance suggest a target pose for the sensor it connects to that would help it recognize the object faster or poses that would help it learn new information about an object. A goal state could also refer to an object in the environment that should be manipulated (for example move object x to location y or change the state of object z). To determine a good target pose, the learning module can use its internal models of objects, its current hypotheses, and information in the short-term memory (buffer) of the learning module. The goal state generator is responsible for the selection of the target goal state based on the higher-level goal state it receives and the internal state of the learning module. For more details, see our hypothesis driven policies documentation.
First a few words on memory in learning modules in general. Each LM has two types of memory, a short-term memory ("buffer") which stores recent observations and a long-term memory which stored the structured object models.
Each learning module has a buffer class which could be compared to a short term memory (STM). The buffer only stores information from the current episode and is reset at the start of every new episode. Its content is used to update the graph memory at the end of an episode. (The current setup assumes that during an episode only one object is being explored, otherwise the buffer would have to be reset more often). The buffer is also used to retrieve the location observed at the previous step for calculating displacements. Additionally, the buffer stores some information for logging purposes.
Each learning module has one graph memory class which it uses as a long term memory (LTM) of previously acquired knowledge. In the graph learning modules, the memory stores explicit object models in the form of graphs (represented in the ObjectModel class). The graph memory is responsible for storing, updating, and retrieving models from memory.
Object models are stored in the graph memory and typically contain information about one object. The information they store is encoded in reference frames and contains poses relative to each other and features at those poses. There are currently two types of object models:
Unconstrained graphs are instances of the GraphObjectModel class. These encode an object as a graph with nodes and edges. Each node contains a pose and a list of features. Edges connect each node to its k nearest neighbors and contains a displacement and a list of features. More information on graphs will be provided in the next sections.
Constrained graphs are instances of the GridObjectModel class. These models are constrained by their size, resolution, and complexity. Unconstrained graphs can contain an unlimited amount of nodes and they can be arbitrarily close or far apart from each other. Constrained graphs make sure that the learned models have to be efficient by enforcing low resolution models for large objects such as a house and high resolution models for small objects such as a dice. This is more realistic and forces Monty to learn compositional objects which leads to more efficient representations of the environment, a higher representational capacity, faster learning, and better generalization to new objects composed of known parts. The three constraints on these models are applied to the raw observations from the buffer to generate a graph which can then be used for matching in the same way as the unconstrained graph. More information on constrained graphs will be provided in the following sections.
A graph is constructed from a list of observations (poses, features). Each observation can become a node in the graph which in turn connects to its nearest neighbors in the graph (or by temporal sequence), indicated by the edges of the graph. Each edge has a displacement associated with it which is the action that is required to move from one node to the other. Edges can also have other information associated with them, for instance, rotation invariant point pair features (Drost et al., 2010). Each node can have multiple features associated with it or simply indicate that the object exists there (morphology). Each node must contain location and orientation information in a common reference frame (object centric with an arbitrary origin).
See the video in this section for more details: Surface Normals and Principal Curvatures
Similar nodes in a graph (no significant difference in pose or features to an existing node) are removed (see above figure) and nodes could have a variety of features attached to them. Removing similar points from a graph helps us to be more efficient when matching and avoids storing redundant information in memory. This way we store more points where features change quickly (like where the handle attaches to the mug) and fewer points where features are not changing as much (like on a flat surface).
When using constrained graphs, each learning module has three parameters that constrain the size (max_size), resolution (num_voxels_per_dim), and complexity (max_nodes) of models it can learn. These parameters do not influence what an LM sees, they only influence what it will store in memory. For example if an LM with a small maximum model size is moving over a large object, it will perceive every observation on the object and try to recognize it. However, it will not be able to store a complete model of the object. It might know about subcomponents of the object if it has seen them in isolation before and send those to other LMs that can model the entire large object (usually at a lower resolution). Once we let go of the assumption that each episode only contains one object, we also do not need to see the subcomponents in isolation anymore to learn them. We would simply need to recognize them as separate objects (for example because they move independently) and they would be learned as separate models.
To generate a constrained graph, the observations that should be added to memory are first sorted into a 3D grid. The first observed location will be put into the center voxel of the grid and all following locations will be sorted relative to this one. The size of each voxel is determined by the maximum size of the grid (in cm) and the number of voxels per dimension. If more than 10% of locations fall outside of the grid, the object can not be added to memory.
After all observations are assigned to a voxel in the grid, we retrieve three types of information for each voxel:
-
The number of observations in the voxel.
-
The average location of all observations in the voxel.
-
The average features (including pose vectors) of all observations in the voxel.
Finally, we select the k voxels with the highest observation count, where k is the maximum number of nodes allowed in the graph. We then create a graph from these voxels by turning each of the k voxels into a node in the graph and assigning the corresponding average location and features to it.
When updating an existing constrained object model, the new observations are added to the existing summary statistics. Then the new k-winner voxels are picked to construct a new graph.
The three grids used to represent the summary statistics (middle in figure above) are represented as sparse matrices to limit their memory footprint.
We can use graphs in memory to predict if there will be a feature sensed at the next location and what the next sensed feature will be, given an action/displacement (forward model). This prediction error can then be used for graph matching to update the possible matches and poses.
A graph can also be queried to provide an action that leads from the current feature to a desired feature (inverse model). This can be used for a goal-conditioned action policy and more directed exploration. To do this we need to have a hypothesis of the object pose.
In a sensorimotor learning setup, one naturally encounters several different reference frames in which information can be represented. Additionally, Monty has internal reference frames in which it learns models of objects. Those can get confusing to wrap your head around and keep track of, so here is a brief overview of all the reference frames involved in a typical Monty setup.
The orientation of the object in the world (unknown to Monty). For instance, if we use a simulator, this would be the configuration that specifies where the object is placed in the environment (like we do here). In the real world we don't have a specific origin, but all objects are still in a common reference frame with relative displacements between each other.
The pose of all the features on the object (also unknown to the Monty). For instance, if you use a 3D simulator, this could be defined in the object meshes.
The sensor’s location and orientation in the world can be estimated from proprioceptive information and motor efference copies. Basically, as the system moves its sensors, it can use this information to update the locations of those sensors in a common reference frame. For the purposes of Monty it doesn't matter where the origin of this RF is (it could be the agent's body or an arbitrary location in the environment) but it matters that all sensor locations are represented in this common RF. In simulation environments, this pose estimation can be perfect / error-free if desired, while in the real world you usually have noisy estimates.
This is the feature's orientation relative to the sensor. For instance in Monty, we often use the surface normal and curvature direction to define the sensed pose, which are extracted from the depth image of the sensor (code).
The estimated orientation of the sensed feature in the world (sensor_rel_world * feature_rel_sensor). In Monty, this currently happens to the depth values in DepthTo3DLocations while the surface normal and curvature extraction then happens in the SM, but in the end you get the same results.
This is the hypothesized orientation of the learned model. It expressed how the incoming features and movements need to be transformed to be in the model's reference frame. This orientation needs to be inferred by the LM based on its sensory inputs. There are usually multiple hypotheses which Monty learning modules keeps track of in self.possible_poses
The inverse of this pose (model rel. world) expresses how the model would need to be rotated to overlay correctly onto the object in the world. We use this inverse as Monty's pose classification to calculate it's rotation error.
This is the rotation of the currently sensed feature relative to the object model in the LM (feature_rel_world * object_rel_model, see code here). The learning module uses its pose hypothesis to transform the feature relative to the world into its object's reference frame so that it can recognize the object in any orientation and location in the world, independent of where it was learned.
Note
For a description of these transforms in mathematical notation, see our pre-print Thousand-Brains Systems: Sensorimotor Intelligence for Rapid, Robust Learning and Inference.
Note
In the brain we hypothesize that the transformation of sensory input into the object's reference frame is done by the thalamus (see our neuroscience theory paper for more details).
The transform in the sensor module combines the sensor pose in the world with the sensed pose of the features relative to the sensor. This way, if the sensor moves while fixating on a point on the object, that feature pose will not change (see animation below). We are sending the same location and orientation of the feature in the world to the LM, no matter from which angle the sensor is "looking" at it.
This is one of the key definitions of the CMP: The pose sent out from all SMs is in a common reference frame. In Monty, we use the DepthTo3DLocations transform for this calculation and report locations and orientations in an (arbitrary) world reference frame.
If the object changes its pose in the world, the pose hypothesis in the LM comes into play. The LM has a hypothesis on how the object is rotated in the world relative to its model of the object. This is essentially the rotation that needs to be applied to rotate the incoming features and movements into the model’s reference frame.
If the object is in a different rotation than how it was learned, all the features on the object will be sensed in different orientations and location in the world. This needs to be compensated by the yellow projection of the pose hypothesis. This way, the incoming features used for recognition (dark blue) remain constant as the object rotates.
The plot below only shows one pose hypothesis, but in practice, Monty has many of them, and it needs to infer this pose from what it is sensing (dark green orientations are unknown). The animation also assumes that Monty is correctly inferring and updating it's hypothesis as the object is rotating in the world, which is not trivial.
What is not shown in the visualizations above is that the same rotation transform is also applied to the movement vector (displacement) of the sensor (pink). In Monty we calculate how much the sensor has moved in the world by taking the difference between two successive location and orientation inputs to the LM (code). The sensor movement is applied to update the hypothesized location on the object (pink dot).
Before we can apply the movement to update our hypothesized location on the object, the movement needs to be transformed from a movement in the world to movement in the object's reference frame. To do this, the hypothesized object orientation (yellow) is applied, the same way it is applied to the incoming features.
So for example, if object is rotated by 90 degrees to the left and the sensor moved right to left on the object (like in the animation shown below), then the orientation transform will update the location on the object model (pink dot) to move from bottom to top of the object.
Applying the hypothesized object rotation to both the incoming movement and features means that Monty can recognize the object in any new location and orientation in the world.
Voting in Monty happens in model space. We directly translate between the model's RF of the sending LM to the model's RF of the receiving LM. This relies on the assumption that both LMs learned the object at the same time, and hence their reference frames line up in orientation and displacement (since we receive features rel. world, which will automatically line up if the LMs learn the object at the same time). Otherwise, we could store one displacement between their reference frames and apply that as well.
The two LMs receive input from two sensors that sense different locations and orientations in space. They receive that as a pose in a common coordinate system (rel. world in the image below). Since the sensors are at different locations on the object and our hypotheses are “locations of sensor rel. model” we can’t just vote on the hypotheses directly, but have to account for the relative sensor displacement. For example, the sensor that is sensing the handle of the cup needs to incorporate the offset to the sensor that senses the rim to be able to use its hypotheses.
This offset can be easily calculated from the difference of the two LM’s inputs, as those poses are in a common coordinate system. The sending LM attaches its sensed pose in the world to the vote message it sends out (along with its hypotheses about the locations on the mug), and the receiving LM compares it with its own sensed pose(green arrow) and applies the displacement to the vote hypotheses (shifting the pink dots).
Check out our evidence LM documentation for more details on voting.
The simplest example is if both learning modules receive input from the same location in the world (i.e. the two sensors "look" at the same point). In this case, no transformation needs to be applied to the incoming votes (no transformation is represented as a green dot in the figure) and they can directly be overlaid onto the receiving LM's reference frame.

However, voting in this case does not help reduce ambiguity as both LMs receive the same input.
If the two LMs receive input from two different locations in the world (like in the animation above and the image below), we need to calculate the displacement between the two and apply it to the incoming votes. In the example below, the left LM is sensing the left side of the rim and the right LM is sensing the handle. When the right LM receives votes from the left one, it needs to shift those down and towards the right. You can think of the right LM saying "if you are over there, and you think you are at this location on the mug, then I should be x amount to the lower right of that".

The displaced votes are overlaid onto the existing location hypotheses of the receiving LM (purple dots). Only a few points will be consistent with the incoming votes (in this case just one, circled in dark green). When the right LM sends votes to the left one, it's location hypotheses will have to be displaced in the opposite direction (towards the left).
What we haven't shown so far is how voting accounts for the pose hypothesis. The examples above all illustrated cases where the object orientation in the world matched the orientation it was learned in (hence the yellow arrow did not rotate any of the input). The figures above were also simplified in that they only showed one pose hypothesis for the object. In reality, we have one pose hypothesis (yellow arrows) for each hypothesized location (pink dots).
When voting, the sending LM does not only send it's location hypotheses but also the hypothesized pose of the model, where there will be one such pose for each location hypothesis sent. The receiving LM then used these poses to rotate the sensor displacement (thick green arrow). This is similar to how sensor movement in the world needs to be rotated into the model's reference frame.
After this displacement is applied, the vote locations (pink dots) can again be overlaid onto the existing hypothesis space (purple dots).
In a hierarchy of Learning Modules (LMs), the higher-level LM receives the output of the lower-level LM. This output represents features at poses, the same as the output of sensor modules. The features that an LM outputs are descriptors of the model it recognized internally, for instance, an object ID and its pose. It is advantageous if the object IDs are descriptive enough that similarity metrics can be applied. When similarity is incorporated into the object representations, similar objects can be treated similarly, allowing the higher-level LM to generalize to variability in object structures. For example, a car should be recognized regardless of the brand of tires that are on it. This is analogous to an LM recognizing slightly different shades of blue at the same location on an object.
We choose Sparse Distributed Representations (SDR) to represent objects and use overlap as a similarity metric (i.e., higher overlap of bits translates to higher similarity of objects). As a brief reminder, SDRs are high-dimensional, binary vectors with significantly more 0 than 1 bits. The choice of SDRs as a representation for objects allows us to benefit from their useful properties, such as high representational capacity, robustness to noise, and ability to express unions of representations.
The problem of encoding object representations into SDRs is twofold:
- Estimating the similarity between stored graphs.
- Creating SDR representations with overlaps matching the desired similarity.
Ideally, objects that share morphological features (e.g., cylinder-like) or non-morphological features (e.g., red) should be represented with high bit overlaps in their SDR representations, as shown in the figure below. Estimating the similarity between 3D graphs is not a trivial problem; however, the notion of evidence can help us rank objects by their similarity with respect to the most likely object of an episode.
Consider a learning module actively sensing a single object environment (e.g., a fork). The EvidenceGraphLM initializes a hypothesis space of possible locations and poses on all graphs in its memory (e.g., fork, knife, and mug). With every observation, the LM accumulates evidence on each valid hypothesis across all objects. In this case, the fork and knife will accumulate relatively more evidence than the mug, as shown in the figure below (left). Therefore, we use the difference between the evidence scores of the best hypothesis for each object (i.e., relative evidence) to represent similarity (i.e., comparable evidence means high similarity, as shown in the figure below).
These relative evidence scores are then linearly mapped to the desired bit overlap range (e.g., [-8,0] -> [0,41] where 41 would be the number of on bits in each SDR and, therefore, the maximum overlap). Each episode populates a single row of the target overlaps matrix because it estimates the pairwise similarity of all objects with respect to the sensed object (i.e., most-likely object).
It is important to note that since the hypothesis space already considers all possible locations and orientations of the objects, we do not have to align (i.e., translate or rotate) the graphs for comparison; this is already performed while initializing the hypotheses space. For simplicity, we only show a single best hypothesis on the objects in the figure above, but matching considers all existing hypotheses. Similarity estimation is a byproduct of evidence matching and comes at virtually no extra computational cost.
We now take the matrix of target overlaps ([0, 41] in the figure above, [0, 3] in the figure below) and generate SDRs from it. The highest amount of overlap is represented by the values on the diagonal (i.e., the overlap of an SDR with itself) and is always the same as the number of active bits in the SDR, defining its sparsity. We impose these pairwise target overlap scores as a constraint while creating SDR representations. This problem can be thought of as an optimization problem with the objective of minimizing the difference between the actual SDR overlap and this matrix of target overlaps.
Consider a set of objects and their pairwise target overlap (extracted from evidence scores); the goal is to create SDRs (one for each object) with pairwise overlaps matching the target overlaps. One seemingly simple solution is to directly optimize the object SDR representations to the desired bit overlaps. However, optimizing binary vectors can prove challenging due to sharp gradients. In what follows, we describe a simple encoding mechanism and provide a step-by-step toy example for creating and optimizing these SDRs from real-valued representations. It is important to note that while the following approach uses gradient descent, it should not be viewed as an instance of deep learning as there are no perceptrons, non-linear activation functions, or hidden weights.
The goal is to create object SDRs with target overlaps; therefore, we first randomly initialize real-valued object representations. Each dense vector represents a single object and has the size of the desired SDR length. We also define a Top-K readout function, which converts the dense representation to SDR with the desired sparsity (i.e., K is the number of active bits in each SDR). In this toy example, we use a size of 10 total bits for each SDR with 3 active bits (i.e., 70% sparsity). Note that these are not the typical values we use in SDRs but are only used here for visualization purposes. More typical values would be 41 bits on out of 2048 possible (98% sparsity). Three objects are initialized, as shown in the figure below.
In this step, we quantify the error in overlap bits between the randomly initialized representations and the target overlaps. The error is calculated from the difference between the current and target overlaps. The error matrix is the same size as the overlap matrices, with values indicating whether two representations need to become more or less similar (i.e., the sign of the error) and by how much (i.e., the magnitude of the error).
The error in overlap bits cannot be used to optimize the object SDRs directly since the top K function is not differentiable. We, therefore, apply it as a weight on the pairwise distance of the dense representations. The pairwise distance is calculated using the L2 distance function between each pair of dense vectors. Note that these vectors are initially random and are not derived from evidence values. If the overlap error indicates that two objects should be more similar, we minimize the distance between these two objects in Euclidean space and vice versa. This indirectly minimizes the overall error in pairwise overlap bits. Note that gradients only flow through the dense similarity calculations and are not allowed to flow through the Top-K readout function as indicated in the overview figure by Stop Gradient.
The encoding algorithm is implemented as an LM Mixin by the name EvidenceSDRLMMixin. This Mixin overrides the initialization of the LM, and the post_episode function to collect evidence values after each episode and use them to optimize a set of object representations. It configures the evidence updater with the SDRFeatureEvidenceCalculator and the AllFeaturesForMatchingChecker. We define the EvidenceSDRGraphLM, which incorporates the Mixin and can be used as a drop-in replacement for EvidenceGraphLM when defining an experiment config.
Want to dive deeper into this topic? Have a look at this presentation: [block:embed] { "html": "<iframe class="embedly-embed" src="//cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FqHdl8tlzY3k%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DqHdl8tlzY3k&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FqHdl8tlzY3k%2Fhqdefault.jpg&type=text%2Fhtml&schema=youtube" width="640" height="480" scrolling="no" title="YouTube embed" frameborder="0" allow="autoplay; fullscreen; encrypted-media; picture-in-picture;" allowfullscreen="true"></iframe>", "url": "https://www.youtube.com/watch?v=qHdl8tlzY3k", "title": "2024/08 - Encoding Object Similarity in SDRs", "favicon": "https://www.youtube.com/favicon.ico", "image": "https://i.ytimg.com/vi/qHdl8tlzY3k/hqdefault.jpg", "provider": "https://www.youtube.com/", "href": "https://www.youtube.com/watch?v=qHdl8tlzY3k", "typeOfEmbed": "youtube" } [/block]
title: CMP & Hierarchy Improvements description: Improvements we would like to add to the CMP or hierarchical information processing.
These are the things we would like to implement:
- Figure out performance measures and supervision in heterarchy #infrastructure #compositional
- Add top-down connections #numsteps #multiobj #compositional
- Run & Analyze experiments with >2LMs in heterarchy testbed #compositional
- Run & Analyze experiments in multi-object environment looking at scene graphs #multiobj
- Test learning at different speeds depending on level in hierarchy #learning #generalization
- Send similarity encoding object ID to next level & test #compositional
- Global Interval Timer #dynamic
- Include State in CMP #dynamic
!snippet[../snippets/contributing-tasks.md]
📘 Heterarchy vs. Hierarchy
We sometimes use the term heterarchy to express the notion that, similar to the brain, connections in Monty aren't always strictly hierarchical. There are many long-range connections (voting) within the same hierarchical level and across level, as well as skip-connections. Also, every level in the "hierarchy" has a motor output (goal state) that it sends to the motor system.
title: Environment Improvements description: New environments and benchmark experiments we would like to add.
These are the things we would like to implement:
- Make dataset to test compositional objects #compositional #multiobj
- Object behavior test bed #dynamic
- Set up Environment that allows for object manipulation #goalpolicy
- Set up object manipulation benchmark tasks and evaluation measures #goalpolicy
- Create dataset and metrics to evaluate categories and generalization #generalization
- Create dataset and metrics to test new feature-morphology pairs #featsandmorph
!snippet[../snippets/contributing-tasks.md]
title: Framework Improvements description: Improvements we would like to make on the general code framework.
These are the things we would like to implement:
- Add infrastructure for multiple agents that move independently #numsteps #infrastructure #goalpolicy
- Automate benchmark experiments & analysis #infrastructure
- Add more wandb logging for learning from unsupervised #learning
- Add GPU support for Monty #speed
- Use State class inside of LMs #infrastructure
- Make configs easier to use #infrastructure
- Find faster alternative to KDTree search #speed
!snippet[../snippets/contributing-tasks.md]
title: Learning Module Improvements description: Improvements we would like to add to the learning modules.
We have a guide on customizing learning modules here.
These are the things we would like to implement:
- Use off-object observations #numsteps #multiobj
- Implement and test rapid evidence decay as form of unsupervised memory resetting #multiobj
- Improve bounded evidence performance #multiobj
- Use models with fewer points #speed #generalization
- Make it possible to store multiple feature maps on one graph #featsandmorph
- Test particle-filter-like resampling of hypothesis space #accuracy #speed
- Re-anchor hypotheses for robustness to noise and distortions #deformations #noise #generalization
- Less dependency on first observation #noise #multiobj
- Deal with incomplete models #learning
- Implement & test GNNs to model object behaviors & states #dynamic
- Deal with moving objects #dynamic #realworld
- Support scale invariance #scale
- Improve handling of symmetry #pose
- Use Better Priors for Hypothesis Initialization #numsteps #pose #scale
- Include State in Models #dynamic
- Include State in Hypotheses #dynamic
- Event Detection to Reset Timer #dynamic
- Speed Detection to Adjust Timer #dynamic
!snippet[../snippets/contributing-tasks.md]
title: Motor System Improvements description: Improvements we would like to add to the motor system.
These are the things we would like to implement:
- Interpret goal states in motor system & switch policies #goalpolicy
- Implement switching between learning and inference-focused policies #learning
- Bottom-up exploration policy for surface agent #learning
- Model-based exploration policy #learning #numsteps
- Implement efficient saccades driven by model-free and model-based signals #numsteps #multiobj
- Learn policy using RL and simplified action space #numsteps #speed
- Decompose goals into subgoals & communicate #goalpolicy
- Reuse hypothesis testing policy target points #goalpolicy #numsteps
- Implement a simple cross-modal policy #learning #multiobj #goalpolicy #numsteps
- Model-based policy to recognize an object before moving onto a new object #multiobj #compositional
- Policy to quickly move to a new object #speed #multiobj #compositional
!snippet[../snippets/contributing-tasks.md]
title: OSS/Communication Improvements description: Things we would like to do for the open-source community or general communication.
These are the things we would like to do:
!snippet[../snippets/contributing-tasks.md]
title: Project Roadmap description: An overview of tasks we plan to work on or would welcome contributions on.
We have a high-level overview table of tasks on our roadmap in our project planning spreadsheet.
These are tasks where we have a rough idea of how we want to achieve them but we haven't necessarily scheduled when we will work on them or who will work on them. We welcome anyone who would like to pick up one of these tasks and contribute to it.
Tasks are categorized in two ways:
- Which part of Monty does it touch? This is represented in the columns, and is organized into Sensor Module, Learning Module, Motor System, Voting, Hierarchy, Environment, Framework, and Open-Source/Communication.
- Which capabilitie(s) does it improve? This is represented by hashtags (#) at the end of the task description. We keep track of the progress on the capabilities in the column on the right. Each task should improve at least one capability but can contribute to multiple.
Tasks that are done have a check mark next to them and are shaded in green. When a task gets checked off, it will add progress to the corresponding capabilities on the right.
Some of the tasks are under active development by our team or scheduled to be tackled by us soon. Those are shaded in color. Below the main table, you can find a list of our past and current milestones with more detailed descriptions, timeline, and who is working on it. The colors of the milestones correspond to the colors in the main table.
Tasks that are actively worked on have a little player icon on them. Each member of our TBP team has an icon and we drag it to the task they are currently working on. This way we can get a quick impression of which parts and capabilities are currently being worked on at a glance.
We also have two stars that mark the current top two priorities of the TBP team.
To get a more concrete idea of our goals and roadmap for Q2 - 2025, have a look at this video: [block:embed] { "html": "<iframe class="embedly-embed" src="//cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2F4by5MeJ1IT8%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D4by5MeJ1IT8&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2F4by5MeJ1IT8%2Fhqdefault.jpg&type=text%2Fhtml&schema=youtube" width="854" height="480" scrolling="no" title="YouTube embed" frameborder="0" allow="autoplay; fullscreen; encrypted-media; picture-in-picture;" allowfullscreen="true"></iframe>", "url": "https://www.youtube.com/watch?v=4by5MeJ1IT8", "title": "2025/04 - Q2 Roadmap", "favicon": "https://www.youtube.com/favicon.ico", "image": "https://i.ytimg.com/vi/4by5MeJ1IT8/hqdefault.jpg", "provider": "https://www.youtube.com/", "href": "https://www.youtube.com/watch?v=4by5MeJ1IT8", "typeOfEmbed": "youtube" } [/block]
We are happy if one of the tasks on this list piques your interest and you would like to contribute by tackling it!
We usually have some rough ideas of how we want to implement these tasks. These are usually written out in the corresponding sections under the "Future Work" category (sorted by Monty components, corresponding to the columns in the table above). Please have a read of the task outlines there before starting to work on one of them. If an item is on our roadmap document that you are interested in, but we don't have any details written out yet, please feel free to contact us and we can update the documentation.
Once you have a concrete idea of how you would like to tackle a task (whether it is similar to how we outlined it or different), the first step is to open an RFC describing your planned approach.
Once your RFC is merged and active, we will add a player icon for you on the table (using your GitHub or discourse profile picture). This way we can keep an overview of which parts of the code people are currently working on. If you decide not to pursue the implementation further, please notify us so that someone else can pick up the task.
📘 More Ways to Contribute
Besides these more involved and research/code heavy tasks on our roadmap, there are many other ways you can get involved and support this project. Check out our page on ways to contribute to find out what best matches your skillset and interests.
Obviously, this table doesn't cover everything we want the system to eventually be able to do or all the ideas you might be coming up with.
This table only contains tasks where we already have a more concrete idea of how we want to implement them. There are many topics where we are still actively brainstorming for solutions. We have a collection of open questions here. And if you are interested in our thoughts on them, the best place is to watch our meeting recordings on our YouTube channel.
We also welcome contributions that are not related to any of these tasks. Please remember to fill out an RFC first before working on a larger change. For a list of ways you can contribute besides code, see this guide.
title: Sensor Module Improvements description: Improvements we would like to add to the sensor modules.
These are the things we would like to implement:
- Extract better features (predefined, classic CV, ANNs, contrastive learning,...) #noise #accuracy #numsteps
- Detect local and global flow #realworld #dynamic
- Change detecting SM #dynamic
!snippet[../snippets/contributing-tasks.md]
These are the things we would like to implement:
- Use pose for voting #numsteps #pose
- Outline routing protocol/attention #speed #multiobj
- Generalize voting to associative connections (arbitrary object ID and reference frame pairs) #infrastructure
- Can we change the CMP to use displacements instead of locations?
- Vote on State #dynamic
!snippet[../snippets/contributing-tasks.md]
In Monty systems, low-level LMs project to high-level LMs, where this projection occurs if their sensory receptive fields are co-aligned. Hierarchical connections should be able to learn a mapping between objects represented at these low-level LMs, and objects represented in the high-level LMs that frequently co-occur.
For example, a high-level LM of a dinner-set might have learned that the fork is present at a particular location in its internal reference frame. When at that location, it would therefore predict that the low-level LM should be sensing a fork, enabling the perception of a fork in the low-level LM even when there is a degree of noise or other source of uncertainty in the low-level LM's representation.
In the brain, these top-down projections correspond to L6 to L1 connections, where the synapses at L1 would support predictions about object ID. However, these projections also form local synapses en-route through the L6 layer of the lower-level cortical column. In a Monty LM, this would correspond to the top-down connection predicting not just the object that the low-level LM should be sensing, but also the specific location that it should be sensing it at. This could be complemented with predicting a particular pose of the low-level object (see Use Better Priors for Hypothesis Initialization).
This location-specific association in both models is key to how we believe compositional objects are represented. For example, if you had a coffee mug with a logo on it, that logo might make an (unusual) 90 degree bend half-way along its length. This could be learned by associating the logo with the mug multiple times, where different locations in the logo space, as well as different poses of the logo, would be associated with the logo depending on the location on the mugs surface.
As we introduce hierarchy and compositional objects, such as a mug with a logo on it, we need to figure out both how to measure the performance of the system, and how to supervise the learning. For the latter, we might choose to train the system on component objects in isolation (various logos, mugs, bowls, etc.) before then showing Monty the full compositional object (a mug with a logo on it). When evaluating performance, we might then see how well the system retrieves representations at different levels of the hierarchy. However, in the more core setting of unsupervised learning, representations of the sub-objects would likely also emerge at the high level (a coarse logo representation, etc.), while we may also find some representations of the mug in low-level LMs. Deciding then how we measure performance will be more difficult.
When we move to objects with less obvious composition (i.e. where the sub-objects must be disentangled in a fully unsupervised manner), representations will emerge at different levels of the system that may not correspond to any labels present in our datasets. For example, handles, or the head of a spoon, may emerge as object-representations in low-level LMs, even though the dataset only recognizes labels like "mug" and "spoon".
This is less clear, but one approach to measure the "correctness" of representations in this setting might be how well a predicted representation aligns with the outside world. For example, while LMs are not designed to be used as generative models, we could visualize how well an inferred object graph maps onto the object actually present in the world. Quantifying such alignment might leverage measures such as differences in point-clouds. This would provide some evidence of how well the learned decomposition of objects corresponds to the actual objects present in the world.
See also Make Dataset to Test Compositional Objects and Metrics to Evaluate Categories and Generalization.
title: Global Interval Timer description: Add a (semi) global interval timer that provides information about time elapsed since the last event to LMs.
This item relates to the broader goal of modeling object behaviors in Monty.
We want to have an interval timer that exists separately from the current Monty components and provides input to a large group of LMs (or all).
The interval timer is a bit like a stop watch that counts up time from a previous event and broadcasts this elapsed time widely. The time can be reset by significant events that are detected in LMs (or SMs).
In the brain, this elapsed time might be encoded with time cells (e.g. see Kraus, 2013) or a similar mechanism, where different cells fire at different temporal delays from the start of an interval. Together they tile the temporal space (up to a certain max duration and with higher resolution for shorter durations). This means that the LM receives different neural input from the timer, depending on how much time has passed since the start of this state.
In the conceptual figure below, this mechanism is illustrated as the circle with little purple dots on it. Each significant event resets the circle to have the top neuron (darkest purple) active. Then, as time passes, the neurons on the circle become active in a clockwise direction. The ID of the active time neuron is the input to L1 of the cortical column.
There is a default speed at which the timer goes through the different active states. This is the timing information stored during learning. During inference, however, the column can tell the interval timer to spin faster or slower, which means that it will receive the same neural input to L1 at different absolute times from the last event. Note that there is no continuous "time-since" signal but instead there are discrete representations for different elapsed intervals and hence, these representations can change faster or slower and be associated with features in the models.
The animation below shows an example of learning a melody. Here, each note is a significant event that resets the timer.
Whatever an input feature (here, the note, although one could also use a background beat instead) comes into the column/LM, it gets associated with the L1 input that is received at the time, which represents the time passed since the last event.
Whenever the timer is reset, we move forward by 1 in the element in the sequence (here mapped onto L5b, but that is speculative. Since we only move through the sequence in one direction, no path integration is needed.)
In the continuation of this example below you can see that if there is a longer interval, the clock will spin further through the different neural representations and the next feature will be associated with a later on (lighter purple).
Next, let's look at an example of an object behavior. Here we have both discrete events (when the stapler starts moving and when it reaches it’s lowest position and staples) and continuous changes between events. This could be using the above mechanism as follows:
The clock is reset and starts ticking through the different neural activations as before. However, as we activate different time cells, we receive different inputs in L4 and associate these different inputs with the different time cells. This allows us to represent feature inputs at different offsets after the last significant event. When a significant event occurs (i.e. stapler reaches the bottom position and tacks some paper), the clock is reset and we move one forward in the element in the sequence (L5b)
One other change here is that we are sensing in 3D space (2D in the example diagram), so in addition to our position in the sequence, we have a location on the object (L6a, shown as grid here). This location is updated using sensor movement (our existing mechanism). In the example here, the sensor doesn’t move so we just store sequence information at that location but if we explore a behavior over and over again at different locations, this would become a richer representation of the entire object behavior.
The learned model would then look as shown below. We have discrete states in a sequence (movement through the sequence is determined by resets of the global clock). Each state can store features at different locations at different intervals after the last event (colored arrows).
For inference, we would need to infer the location in the object’s reference frame, as usual. In addition, we need to infer the location in the sequence (in the sequence of discrete states). The interval within each state does not need to be inferred, as it is provided by the global clock. (see page on including state in hypotheses)
However, if there is a mismatch between which feature is sensed and the expected feature based on the input into L1, this can be used as a signal to speed up or slow down the global clock (see page on speed detection to adjust timer).
Higher resolution and lower matching tolerance for short durations compared to long ones could be implemented by tiling the short duration more densely (i.e. having more distinct neural representations as input to the LM for those).
title: Include State in CMP description: Include the inferred state of an object as part of the CMP message.
This item relates to the broader goal of modeling object behaviors in Monty.
To infer object or behavioral state efficiently and communicate it between LMs, we need to add it to the CMP messages. This can be done by updating the CMP definition (currently represented in the unfortunately named State class whose name we should probably change in the scope of this).
We have implemented the ability to encode object IDs using sparse-distributed representations (SDRs), and in particular can use this as a way of capturing similarity and dissimilarity between objects. Using such encodings in learned Hierarchical Connections, we should observe a degree of natural generalization when recognizing compositional objects.
For example, assume a Monty system learns a dinner table setting with normal cutlery and plates (see examples below). Separately, the system learns about medieval instances of cutlery and plates, but never sees them arranged in a dinner table setting. Based on the similarity of the medieval cutlery objects to their modern counterparts, the objects should have considerable overlap in their SDR encodings.
If the system was to then see a medieval dinner table setting for the first time, it should be able to recognize the arrangement as a dinner-table setting with reasonable confidence, even if the constituent objects are somewhat different from those present when the compositional object was first learned.
We should note that we are still determining whether overlapping bits between SDRs is the best way to encode object similarity. As such, we are also open to exploring this task with alternative approaches, such as directly making use of values in the evidence-similarity matrix (from which SDRs are currently derived).
Example of a standard dinner table setting with modern cutlery and plates that the system could learn from.
Example of a medieval dinner table setting with medieval cutlery and plates that the system could be evaluated on, after having observed the individual objects in isolation.
Our general view is that episodic memory and working memory in the brain leverage similar representations to those in learning modules, i.e. structured reference frames of discrete objects.
For example, the brain has a specialized region for episodic memory (the hippocampal complex), due to the large number of synapses required to rapidly form novel binding associations. However, we believe the core algorithms of the hippocampal complex follow the same principles of a cortical column (and therefore a learning module), with learning simply occurring on a faster time scale.
As such, we would like to explore adding forms of episodic and working memory by introducing high-level learning modules that learn information on extremely fast time scales relative to lower-level LMs. These should be particularly valuable in settings such as recognizing multi-object arrangements in a scene, and providing memory when a Monty system is performing a multi-step task. Note that because of the overlap in the core algorithms, LMs can be used largely as-is for these memory systems, with the only change being the learning rate.
It is worth noting that the GridObjectModel would be particularly well suited for introducing a learning-rate parameter, due to its constraints on the amount of information that can be stored.
As a final note, varying the learning rate across learning modules will likely play an important role in dealing with representational drift, and the impact it can have on continual learning. For example, we expect that low-level LMs, which partly form the representations in higher-level LMs, will change their representations more slowly.
For the distant agent, we have a policy specifically tailored to learning, the naive scan policy, which systematically explores the visible surface of an object. We would like a similar policy for the surface agent that systematically spirals or scans across the surface of an object, at least in a local area.
This would likely be complemented by Model-Based Exploration Policies.
This will be most relevant when we begin implementing policies that change the state of the world, rather than just those that support efficient sensing and inference.
One example task we imagine is setting a dinner table. At the higher level of the system, a learning module that models dinner-tables would receive the goal-state to have the table in the "set for eating" state. This might be a vision-based LM that can use it's direct motor-output to saccade around the scene, and infer whether the table is set, but cannot actually move objects.
It might perceive, for example, that the fork is not in the correct location for the learned model of a set dinner table. As such, it could pass a goal-state to the LM that models forks to be in the required pose in body-centric coordinates. The fork modeling-LM, which has information about the morphology of the fork, could then send goal-states directly to the motor-system, or to an LM that controls actuators like a hand. In either case, the ultimate aim is to apply pressure to the fork such that it achieves the desired goal state of being in the correct location.
To set the entire dinner table, the higher-level LM would send out the sub-goal of the fork in the correct position, before moving on to other components of the table object, such as setting the position of the knife.
In the above example, neither the dinner-table, fork, nor hand LMs have sufficient knowledge to complete the task on their own. Instead, it must be decomposed into a series of sub-goal states.
How exactly we define the goal-states that carry out the practical process of applying pressure to move the fork is still a point of discussion, and so an early implementation might assume that a sub-cortical policy is already known that can move objects around the scene, based on a receive goal-state. Alternatively, we might begin with a simpler task such as pressing a button or key, where the motor policy simply needs to apply force at a specific location.
Actually learning the causal relationships between states in low-level objects and high-level objects is also an aspect we are still developing ideas for. However, we know that these will be formed via hierarchical connections between LMs, similar to the Top Down Connections Used for Sensory Prediction.
Once we have infrastructure support for multiple agents that move independently (see Add Infrastructure for Multiple Agents that Move Independently), we would like to implement a simple cross-modal policy for sensory guidance.
In particular, we can imagine a distant-agent rapidly saccading across a scene, observing objects of interest (see also Implement Efficient Saccades). When an object is observed, the LM associated with the distant-agent could send a goal-state (either directly or via an actuator-modeling LM) that results in the surface agent moving to that object and then beginning to explore it in detail.
Such a task would be relatively simple, while serving as a verification of a variety of components in the Cortical Messaging Protocol, such as:
- Recruiting agents that are not directly associated with the current LM, using goal-states (e.g. here we are recruiting the surface agent, rather than the distant agent).
- Coordination of multiple agents (the surface agent and distant agent might each inform areas of interest for the other to explore).
- Multi-modal voting (due to limited policies, voting has so far been limited to within-modality settings, although it supports cross-modal communication).
Currently the main way that the distant agent moves is by performing small, random, saccade-like movements. In addition, the entire agent can teleport to a received goal-state in order to e.g. test a hypothesis. We would like to implement the ability to perform larger saccades that are driven by both model-free and model-based signals, depending on the situation.
In the model-free case, salient information available in the view-finder could drive the agent to saccade to a particular location. This could rely on a variety of computer-vision methods to extract a coarse saliency map of the scene. This is analogous to the sub-cortical processing performed by the superior colliculus (see e.g. Basso and May, 2017).
In the model-based case, two primary settings should be considered:
- A single LM has determined that the agent should move to a particular location in order to test a hypothesis, and it sends a goal-state that can be satisfied with a saccade, rather than the entire agent jumping/teleporting to a new location. For example, saccading to where the handle of a mug is believed to be will refute or confirm the current hypothesis. This is the more important/immediate use case.
- Multiple LMs are present, including a smaller subset of more peripheral LMs. If one of these peripheral LMs observes something of interest, it can direct a goal-state to the motor system to perform a saccade such that a dense sub-set of LMs are able to visualize the object. This is analogous to cortical feedback bringing the fovea to an area of interest.
Such policies are particularly important in an unsupervised setting, where we will want to more efficiently explore objects in order to rapidly determine their identity, given we have no supervised signal to tell us whether this is a familiar object, or an entirely new one. This will be compounded by the fact that evidence for objects will rapidly decay in order to better support the unsupervised setting.
Unlike the hypothesis-testing policy, we would specifically like to implement these as a saccade policy for the distant agent (i.e. no translation of the agent, only rotation), as this is a step towards an agent that can sample efficiently in the real world without having to physically translate sensors through space. Ideally, this would be implemented via a unified mechanism of specifying a location in e.g. ego-centric 3D space, and the policy determining the necessary rotation to focus at that point in space.
Currently, a Monty system cannot flexibly switch between a learning-focused policy (such as the naive scan policy) and an inference-focused policy. Enabling LMs to guide such a switch based on their internal models, and whether they are in a matching or exploration state, would be a useful improvement.
This would be a specific example of a more general mechanism for switching between different policies, as discussed in Switching Policies via Goal States.
Similarly, an LM should be able to determine the most appropriate model-based policies to initialize, such as the hypothesis-testing policy vs. a model-based exploration policy.
We would like to implement a state-switching mechanism where an LM (or multiple LMs) can pass a goal-state to the motor system to switch the model-free policies that it is executing.
For example, we might like to perform a thorough, random walk in a small region if the observations are noisy and we would like to sample them densely. Alternatively, we might like to move quickly across the surface of an object, spending little time in a given region.
This task also relates to Enable Switching Between Learning and Inference-Focused Policies.
Learning policies through rewards will become important when we begin implementing complex policies that change the state of the world. However, these could also be relevant for inference and learning, for example by learning when to switch policies instead of adhering to a single heuristic like in the curvature following policy.
In general, we envision that we would use slow, deliberate model-based policies to perform a complex new task, such as one that involves coordinating multiple actuators. Initially, the action would always be performed in this slow, model-based manner. However, with each execution of the task, these sequences of movements provide samples for training a model-free policy to efficiently coordinate relevant actuators in parallel, and without the expensive sampling cost of model-based policies.
For example, learning to oppose the finger and thumb in order to make a pinch grasp might initially involve moving one digit until it meets the surface of the object or the other digit, and then applying force with the other. Over time, a model-free policy could learn to move both digits together, with this "pinch policy" recruited by top-down goal-states as necessary.
In addition to supporting efficient, parallel execution of actions, learned model-free policies will be important for more refined movements. For example, the movement required to press a very small button or balance an object might be coarsely guided by a model-based policy, but the fine motor control required to do so would be adjusted via a model-free policy.
During exploration/learning-focused movement, we do not make use of any model-based, top-down policies driven by LMs. Two approaches we would like to implement are:
- A model-based policy that moves the sensors to areas that potentially represent the explored limits of an object. For example, if we've explored the surface of an object but not the entirety of it, then there will be points on the edge of the learned model with few neighboring points. Exploring in these regions is likely to efficiently uncover novel observations. Note that a "false-positive" for this heuristic is that thin objects like a wire or piece of paper will have such regions naturally at their edges, so it should only represent a bias in exploration, not a hard rule.
- A model-based policy that spends more time exploring regions associated with high-frequency feature changes, or discriminative object-parts. For example, the spoon and fork in YCB are easy to confuse if the models of their heads are not sufficiently detailed. Two heuristics to support greater exploration in this area could include:
- High-frequency changes in low-level features means we need a more detailed model of that part of the object. For example, the surface normals change frequently at the head of the fork, and so we likely need to explore it in more detail to develop a sufficiently descriptive model. The feature-change sensor-module is helpful for ensuring these observations are processed by the learning-module, but a modified policy would actually encourage more exploration in these regions.
- Locations that are useful for distinguishing objects, such as the fork vs. spoon heads, are worth knowing in detail, because they define the differences between similar objects. These points correspond to those that are frequently tested by the hypothesis-testing policy (see Reuse Hypothesis-Testing Policy Target Points), and such stored locations can be leveraged to guide exploration.
- As we introduce hierarchy, it may be possible to unify these concepts under a single policy, i.e. where frequently changing features can either be at the sensory (low-level) input, or at the more abstract level of incoming sub-objects.
Model-based exploration policies will be particularly important in an unsupervised setting. In particular, Monty will need to efficiently explore objects in order to recognize the commonality between different perspectives from which they are observed. For example, after learning a coffee mug from one view, and then seeing it from a different angle, Monty can move around to see that these are two views of the same object, and combine these into a single model. However, this recognition requires some overlap between the parts of the mug that have been see during the first exposure and what is observed on the second exposure. Learning the full structure of an object through active exploration should be easier than relying on separate, idealized views of the object (e.g. 14 rotations corresponding to the 6 faces and 8 corners of a cube), where the overlap between view points may be minimal.
Such active exploration will also be important when dealing with compositional or multi-object settings, where establishing views of all sides of an object is not straightforward due to the presence of other objects or object-parts.
When there are multiple objects in the world (including different parts of a compositional object), it is beneficial to recognize the object currently observed (i.e. converge to high confidence) before moving onto a new object.
Such a policy could have different approaches, such as moving back if an LM believes it has moved onto a new object (reactive), or using the model of the most-likely-hypothesis to try to stay on the object (pro-active), or a mixture of these. In either case, these would be instances of model-based policies.
When exploring an environment with multiple objects (including components of a compositional object), it is beneficial to quickly move to a new object when the current one has been recognized, so as to rapidly build up a model of the outside world.
It would be useful to have a policy that uses a mixture of model-free components (e.g. saliency map) and model-based components (learned relations of sub-objects to one another in a higher-level LM) to make a decision about where to move next in such an instance.
This therefore relates to both Model-based policy to recognize an object before moving on to a new object and Implement efficient saccades driven by model-free and model-based signals.
Ideally, both this policy and the policy to remain on an object could be formulated together as a form of curiosity, where a learning module aims to reduce uncertainty about the world state.
The hypothesis-testing policy is able to generate candidate points on an object that, when observed, should rapidly disambiguate between similar objects, or between similar poses of the same object.
Generating these points requires a model-based policy that simulates the overlap in the graphs between the two most likely objects (or the two most likely poses of the same object). This is a relatively expensive operation, and so one approach would be to store these points in long-term memory, reusing them in future episodes.
For example, when we have first learn about the concept of a mug, we might need to deliberately think about the fact that its handle is what distinguishes it from many other cylindrical objects. However, once we have experienced recognizing mugs a few times, we could quickly recall that testing the handle is a good way to confirm whether we are sensing a mug, or some other object. Related to this, an LM can track how sensing different regions of an object affects its evidence values for the collective hypotheses - those areas that have a disproportionate effect on a top hypothesis are likely to be good candidates for testing in future episodes.
Datasets do not typically capture the flexibility of object labels based on whether an object belongs to a broad class (e.g. cans), vs. a specific instance of a class (e.g. a can of tomato soup).
Labeling a dataset with "hierarchical" labels, such that an object might be both a "can", as well as a "can of tomato soup" would be one approach to capturing this flexibility. Once available, classification accuracy could be assessed both at the level of individual object instances, as well as at the level of categories.
We might leverage crowd-sourced labels to ensure that this labeling is reflective of human perception, and not biased by our beliefs as designers of Monty. This also relates to the general problem fo Multi-Label Classification, and so there may be off-the-shelf solutions that we can explore.
Initially such labels should focus on morphology, as this is the current focus of Monty's recognition system. However, we would eventually want to also account for affordances, such as an object that is a chair, a vessel, etc. Being able to classify objects based on their affordances would be an experimental stepping stone to the true measure of the systems representations, which would be how well affordances are used to manipulate the world.
[!TIP] We have implemented a first dataset to test compositional modeling! You can find the dataset descriptions and current performance on it in our benchmark documentation. Nevertheless, we will eventually want to implement a more complex dataset as described below to test scene level representations and generalization.
To test compositional objects, we would like to develop a minimal dataset based on common objects (such as mugs and bowls) with logos on their surfaces. This will enable us to learn on the component objects in isolation, while moving towards a more realistic setting where the component objects must be disentangled from one another. The logo-on-surface setup also enables exploring interesting challenges of object distortion, and learning multiple location-specific associations, such as when a logo has a 90 degree bend half-way along its length.
It's worth noting that observing objects and sub-objects in isolation is often how compositional objects are learned in humans. For example, when learning to read, children begin by learning individual letters, which are themselves composed of a variety of strokes. Only when letters are learned can they learn to combine them into words. More generally, disentangling an object from other objects is difficult without the ability to interact with it, or see it in a sufficient range of contexts that its separation from other objects becomes clear.
We would eventually expect compositional objects to be learned in an unsupervised manner, such as that a wing on a bird is a sub-object, even though it may never have been observed in isolation. When this is consistently possible, we can consider more diverse datasets where the component objects may not be as explicit. At that time, the challenges described in Figure out Performance Measure and Supervision in Heterarchy will become more relevant.
In the future, we will move towards policies that change the state of the world. At this time, an additional dataset that may prove useful is a "dinner-table setting" with different arrangements of plates and cutlery. For example, the objects can be arranged in a normal setting, or aligned in a row (i.e. not a typical dinner-table setting). Similarly, the component objects can be those of a modern dining table, or those from a "medieval" time-period. As such, this dataset can be used to test the ability of Monty systems to recognize compositional objects based on the specific arrangement of objects, and to test generalization to novel compositions. Because of the nature of the objects, they can also be re-arranged in a variety of ways, which will enable testing policies that change the state of the world.
Example of compositional objects made up of modern cutlery and plates.
Example of compositional objects made up of medieval cutlery and plates.
title: Object Behavior Test Bed description: Set up and environment to test modeling and inferring object behavior under various conditions.
This item relates to the broader goal of modeling object behaviors in Monty.
We can test Monty's ability to model object behaviors with successively more difficult tasks:
Level 1:
- Objects that move repeatedly in the same way
Level 2:
- Different morphologies with the same behavior
- Same behavior at varying orientations on morphologies
- Objects that stop moving at some points in the sequence
Level 3:
- One object has multiple behaviors at different locations
Level 4:
- Same behavior at varying speeds
Level 5:
- Behaviors that involve a change in features instead of movement
- 1 sensor patch that gets processed by 2 SMs (one for static and one for moving features) and sent to 2 LMs.
- Supervision during learning -> provide object ID, pose & state in sequence at each step.
- Evaluate accuracy & speed of recognizing behavior & morphology ID, pose & state (+visualize).
- Could start with 2D environment to make the state dimension easier to visualize (keep in mind potential difficulty in extracting movement from a 3D depth image).
See Decompose Goals Into Subgoals & Communicate for a discussion of the kind of tasks we are considering for early object-manipulation experiments. An even simpler task that we have recently considered is pressing a switch to turn a lamp on or off. We will provide further details on what these tasks might look like soon.
Beyond the specifics of any given task, an important part of this future-work component is to identify a good simulator for such settings. For example, we would like to have a setup where objects are subject to gravity, but are prevented from falling into a void by a table or floor. Other elements of physics such as friction should also be simulated, while it should be straightforward to reset an environment, and specify the arrangement of objects (for example using 3D modelling software).
title: Future Work Widget rfc: https://github.com/thousandbrainsproject/tbp.monty/blob/main/rfcs/0015_future_work.md estimated-scope: medium improved-metric: community-engagement output-type: documentation skills: github-actions, python, github-readme-sync-tool, s3, javascript, html, css contributor: codeallthethingz status: in-progress
Build a filterable widget based on the documentation future-work section data that can be inserted into the docs using an iFrame.
title: Make More Condensed Videos About the Project & Monty rfc: required estimated-scope: medium skills: video-editing, content-creation
title: Organize & Start Podcast Series rfc: required estimated-scope: large skills: podcasting, interviewing, writing, video-editing
Movement is core to how LMs process and model the world. Currently, an LM receives an observation encoded with a body-centric location, and then infers a displacement in object-centric coordinates. Similarly, goal-states are specified as a target location in body-centric coordinates, which are then acted upon.
However, a more general formulation might be to use displacements as the core spatial information in the CMP, such that a specific location (in body-centric coordinates or otherwise) is not the primary form of communication outside of an LM or sensor module.
Such an approach might align well with adding information about flow (see Detect Local and Global Flow), modeling moving objects (see Deal With Moving Objects), and supporting abstract movements like the transition from grandchild to grandparent. It would also result in a reformulation of "goal-states" to "goal-displacements".
Note that whatever approach is taken, we would still need to have some information about shared location representations at some level of the system in order to enable coordination and voting between LMs. This may relate to the division of "what" and "where" pathways in the brain, although this is not yet clear and requires further investigation.
Currently, voting relies on all learning modules sharing the same object ID for any given object, as a form of supervised learning signal. Thanks to this, they can vote on this particular ID when communicating with one-another.
However, in the setting of unsupervised learning, the object ID that is associated with any given model is unique to the parent LM. As such, we need to organically learn the mapping between the object IDs that occur together across different LMs, such that voting can function without any supervised learning signal. This is the same issue faced by the brain, where a neural encoding in one cortical column (e.g. an SDR), needs to be associated with the different SDRs found in other cortical columns.
It is also worth noting that being able to use voting within unsupervised settings will enable us to converge faster, offsetting the issue of not knowing whether we have moved to a new object or not. This relates to the fact that evidence for objects will rapidly decay in order to better support the unsupervised setting.
Initially, such voting would be explored within modality (two different vision-based LMs learning the same object), or across modalities with similar object structures (e.g. the 3D objects of vision and touch). However, this same approach should unlock important properties, such as associating models that may be structurally very different, like the vision-based object of a cow, and the auditory object of "moo" sounds. Furthermore, this should eventually enable associating learned words with grounded objects, laying the foundations for language.
Finally, this challenge relates to Use Pose for Voting, where we would like to vote on the poses of objects, since the learned poses are also going to be unique to each LM.
As we create Monty systems with more LMs, it will become increasingly important to be able to emphasize the representations in certain LMs over others, as a form of "covert" attention. This will complement the current ability to explicitly attend to a point in space through motor actions.
For example in human children, learning new language concepts significantly benefits from shared attention with adults ("Look at the -"). A combination of attending to a point in space (overt attention), alongside narrowing the scope of active representations, is likely to be important for efficient associative learning.
Implementation-wise, this will likely consist of a mixture of top-down feedback and lateral competition.
Currently we do not send out pose hypotheses when we are voting, however we believe it will be an important signal to use. One complication is that the poses stored for any given LM's object models are arbitrary with respect to other LM's models, as each uses an object-centric coordinate system.
This relates to Generalize Voting To Associative Connections, which faces a similar challenge.
To make this more efficient, it would also be useful to improve the way we represent symmetry in our object models (see Improve Handling of Symmetry), as this will significantly reduce the number of associative connections that need to be learned for robust generalization.
title: Vote on State description: Update the voting algorithm to take the state of an object into account.
This item relates to the broader goal of modeling object behaviors in Monty.
Since a state is kind of like a sub-ID of the object, we should probably treat it like that when voting. So if a hypothesis includes state A, it will only add evidence for state A of the object (using the existing mechanism to take into account location and orientation).
Currently, Monty's infrastructure only supports a single agent that moves around the scene, where that agent can be associated with a plurality of sensors and LMs. We would like to add support for multiple agents that move independently.
For example, a hand-like surface-agent might explore the surface of an object, where each one of its "fingers" can move in a semi-independent manner. At the same time, a distant-agent might observe the object, saccading across its surface independent of the surface agent. At other times they might coordinate, such that they perceive the same location on an object at the same time, which would be useful while voting connections are still being learned (see Generalize Voting to Associative Connections).
An example of a first task that could make use of this infrastructure is Implement a Simple Cross-Modal Policy for Sensory Guidance.
It's also worth noting that we would like to move towards the concept of "motor modules" in the code-base, i.e. a plurality of motor modules that convert from CMP-compliant goal states to non-CMP actuator changes. This would be a shift from the singular "motor system" that we currently have.
title: Change Detecting SM description: Add a new type of sensor module that detects local changes and output's those as CMP messages.
This item relates to the broader goal of modeling object behaviors in Monty. It also builds on detecting local and global flow in the SM.
As outlined in the theory section on modeling object behaviors, the idea is that we can use the same LM for modeling object behaviors as we do for modeling static objects. The only difference is what kind of input the LM receives. An LM that receives input from the change detecting SM outlined here would be learning behavior models.
The change detecting SM would detect changes such as a local moving feature (like a bar moving in a certain direction) or a local changing feature (like color or illumination changing). Whenever it detects such a change it output's this change as a CMP message. The output has the same structure as the output of other SMs. It contains the location in a common reference frame (e.g. rel the body) at which the change was detected, the orientation of the change (e.g. to direction in which the edge moved), and optional features describing the change (e.g. edge vs. curve moving).
It is important that the SM only outputs information about local changes. If there is global change detected (like all features across the sensor are moving), this usually indicates that the sensor itself is moving, not the objects in the world. For more details see detecting local and global flow.
Potential details to keep in mind:
- May need to update static feature SM to make sure it doesn’t send output when local movement is detected
- May get noisy estimates of 3D movement from a 2D depth image
Our general view is that there are two sources of flow processed by cortical columns. These should correspond to:
- Local flow: detected in a small receptive field, and indicates that the object is moving.
- Global flow: detected in a larger receptive field, and indicates that the sensor is moving. Note however that depending on the receptive field sizes, it may not be possible for a particular learning module to always distinguish these. For example, if an object is larger than the global-flow receptive field, then from that LM's perspective, it cannot distinguish between the object moving and the sensor moving.
Note that flow can be either optical or based on sensed texture changes for a blind surface agent.
Implementing methods so that we can estimate these two sources of flow and pass them to the LM will be an important step towards modeling objects with complex behaviors, as well as accounting for noise in the motor-system's estimates of self-motion.
Eventually, similar techniques might be used to detect "flow" in how low-level LM representations are changing. This could correspond to movements in non-physical spaces, and enable more abstract representations in higher-level LMs. See also Can We Change the CMP to Use Displacements Instead of Locations?
Currently non-morphological features are very simple, such as extracting the RGB or hue value at the center of the sensor patch.
In the short term, we would like to extract richer features, such as using HTM's spatial-pooler or Local Binary Patterns for visual features, or processing depth information within a patch to approximate tactile texture.
In the longer-term, given the "sub-cortical" nature of this sensory processing, we might also consider neural-network based feature extraction, such as shallow convolutional neural networks, however please see our FAQ on why Monty does not currently use deep learning.
Note that regardless of the approach taken, features should be rotation invariant. For example, a textured pattern should be detected regardless of the sensor's orientation, and the representation of that texture should not be affected by the sensor's orientation.
There is significant scope for custom learning modules in Monty. In particular, learning modules can take a variety of forms, so long as their input and output channels adhere to the Cortical Messaging Protocol, and that they model objects using reference frames. However, exactly how a "reference frame" is implemented is not specified.
Currently, our main approach is to use explicit graphs in Cartesian space, with evidence values accumulated, somewhat analogous to a particle filter. An example of an alternative approach would be using grid-cell modules to model reference frames.
In the future, we will provide further guidance on how custom learning modules can be designed and implemented. If this is something you're currently interested in, please feel free to reach out to us.
This work relates to first being able to Detect Local and Global Flow.
Our current idea is to then use this information to model the state of the object, such that beyond its current pose, we also capture how it is moving as a function of time. This information can then be made available to other learning modules for voting and hierarchical processing.
This work also relates to Modeling Object Behaviors and States, as an object state might be quite simple (the object is moving in a straight line at a constant velocity), or more complex (e.g. in a "spinning" or "dancing" state). To pass such information via the Cortical Messaging Protocol, the former would likely be treated similar to pose (i.e. specific information shared, but limited in scope), while the latter would be shared more similar to object ID, i.e. via a summary representation that can be learned via association.
title: Event Detection to Reset Timer description: Add ability to detect significant events which are used as signals to reset the global interval timer.
This item relates to the broader goal of modeling object behaviors in Monty. For a broader overview of the (semi)-global interval timer, see this future work page.
Any LM should have the ability to detect significant events (such as the attack of a new note, an object reaching a canonical state, or a significant change in input features). If one such event is detected, the LM (or LMs) send a signal to the interval timer (outside the LM) which is then reset. The interval timer provides input to a large group of LMs and hence a detected event in one LM can reset the input signal to many LMs.
We might also want SMs to have the ability to detect significant events but this is unclear yet.
In a natural, unsupervised setting, the object that a Monty system is observing will change from time to time. Currently, Monty's internal representations of objects is only explicitly reset by the experimenter, for example at the start of an episode of training.
We would like to see if we can achieve reasonable performance when objects change (including during learning) by having a shorter memory horizon that rapidly decays. Assuming policies are sufficiently efficient in their exploration of objects, this should enable us to effectively determine whether we are still on the same object, on a different (but known) object, or on an entirely new object. This can subsequently inform changes such as switching to an exploration-focused policy (see Implement Switching Between Learning and Inference-Focused Policies).
Note that we already have the past_weight and present_weight parameters, which can be used for this approach. As such, the main task is to set up experiments where objects are switched out without resetting the LMs evidence values, and then evaluate the performance of the system.
If this fails to achieve the results we hope for, we might add a mechanism to explicitly reset evidence values when an LM believes it has moved on to a new object. In particular, we have implemented a method to detect when we have moved on to a new object based on significant changes in the accumulated evidence values for hypotheses. Integrating this method into the LMs is still in progress, but once complete, we would like to complement it with a process to reinitialize the evidence scores within the learning module. That way, when the LM detects it is on a new object, it can cleanly estimate what this new object might be.
Eventually this could be complemented with top-down feedback from a higher-level LM modeling a scene or compositional object. In this case, the high-level LM biases the evidence values initialized in the low-level LM, based on what object should be present there according to the higher-level LM's model. Improvements here could also interact with the tasks of Re-Anchor Hypotheses, and Use Better Priors for Hypothesis Initialization.
We would like to test using local functions between nodes of an LM's graph to model object behaviors. In particular, we would like to model how an object evolves over time due to external and internal influences, by learning how nodes within the object impact one-another based on these factors. This relates to graph-neural networks, and graph networks more generally, however learning should rely on sensory and motor information local to the LM. Ideally learned relations will generalize across different edges, e.g. the understanding that two nodes are connected by a rigid edge vs. a spring.
As noted, all learning should happen locally within the graph, so although gradient descent can be used, we should not back-propagate error signals through other LMs. Please see our related policy on using Numpy rather than PyTorch for contributions. For further reading, see our discussion on Modeling Object Behavior Using Graph Message Passing in the Monty Labs repository.
We have a dataset that should be useful for testing approaches to this task, which can be found in Monty Labs.
At a broader level, we are also investigating alternative methods for modeling object behaviors, including sequence-based methods similar to HTM, however we believe it is worth exploring graph network approaches as one (potentially complementary) approach. In particular, we may find that such learned edges are useful for frequently encountered node-interactions like basic physics, while sequence-based methods are best suited for idiosyncratic behaviors.
LMs currently recognize symmetry by making multiple observations in a row that are all consistent with a set of multiple poses. I.e. if new observations of an object do not eliminate any of a set of poses, then it is likely that these poses are equivalent/symmetric.
To make this more efficient and robust, we might store symmetric poses in long-term memory, updating them over time. In particular:
- Whenever symmetry is detected, the poses associated with the state could be stored for that object.
- Over-time, we can reduce or expand this list of symmetric poses, enabling the LM to establish with reasonable confidence that an object is in a symmetric pose as soon as the hypothesized poses fall within the list.
By developing an established list of symmetric poses, we might also improve voting on such symmetric poses - see Using Pose for Voting.
This item relates to the broader goal of modeling object behaviors in Monty.
Once we integrate state into the models learned in LMs, we will also need to add state to our hypotheses to infer the state the object is in.
To do this, a hypothesis needs to include which of the states/intervals we are in. Hence, they are represented as (l, r, s), where l is the location on the object, r is the rotation of the object, and s is the state in the sequence. In Monty, we would add a states variable to the Hypotheses class (in addition to the existing locations and poses arrays).
Analogous to testing possible locations and possible poses, we would test possible states. Depending on sensory input, the evidence for different location-rotation-state combinations would be inceremented/decremented and hypotheses would be narrowed down.
We can use our existing nearest neighbor search algorithm to retrieve neighbors in 3D space to predict which feature should be sensed next in the current state, at the current location.
Given the input interval stored during learning, the model can also predict with which global clock input the next input feature should coincide. The timing within an interval/state is provided by the global clock and does not need to be inferred. It can however be used to correct the speed of the timer to recognize sequences at different speeds. For more details see the page on speed detection to adjust timer.
Note that adding an additional dimension to the hypothesis space will add a multiplicative factor to it's size and computational cost. Hence using fewer points to represent models will likely be important.
title: Include State in Models description: Add a 'state' dimension to the models learned in Monty that conditions which features to expect at what locations.
This item relates to the broader goal of modeling object behaviors in Monty.
Any model in an LM can be state conditioned. For instance a stapler might be open or closed, and an object behavior is represented as a sequence of states. Depending on the state of an object, the model expects to see different features at different locations.
In Monty, state can be interpreted as a sub-object ID. Depending on the state of an object Monty will look at the model stored for this state (also see include state in hypotheses).
One alternative is to represent state as a 4th dimension in the models such that every location on the object is represented as (x, y, z, s). However, it seems like interpolation along the state/time dimension is likely limited. Furthermore, measuring distance along the state dimension is likely quite different from along the three spatial dimensions. Neither of these points would be reflected well if using a continuous 4D model.
States are learned as an ordered sequence. The model of an object can include an ordered sequence of states as well as the temporal duration between states in the sequence. The interval length between two states is provided by the timer input to the LM and can be stores in the models. States might also be traversed by applying actions, however we don't have a concrete proposal for this yet.
Both behavior & morphology models can have different states and sequences and both can be driven by time or other factors. In other words, there is no difference between the LMs besides their input.
One aspect that we believe may contribute to dealing with object distortions, such as perceiving Dali's melted clocks for the first time, or being robust to the way a logo follows the surface of a mug, is through re-anchoring of hypotheses. More concretely, as the system moves over the object and path-integrates, the estimate of where the sensor is in space might lend greater weight to sensory landmarks, resulting in a re-assessment of the current location. Such re-anchoring is required even without distortions, due to the fact that path integration in the real world is imperfect.
Such an approach would likely be further supported by hierarchical, top-down connections (see also Add Top-Down Connections). This will be relevant where the system has previously learned how a low-level object is associated with a high-level object at multiple locations, and where the low-level object is in some way distorted. In this instance, the system can re-instate where it is on the low-level object, based on where it is on the high-level object. Depending on the degree of distortion of the object, we would expect more such location-location associations to be learned in order to capture the relationship between the two. For example, a logo on a flat surface with a single 90-degree bend in it might just need two location associations to be learned and represented, while a heavily distorted logo would require more.
It's worth emphasizing that this approach would also help reduce the reliance on the first observation. In particular, the first observation initializes the hypothesis space, so if that observation is noisy or doesn't resemble any of the points in the model, it has an overly-large impact on performance.
title: Speed Detection to Adjust Timer description: Add ability to detect offsets in timer input and learned sequence model to speed up or slow down the global interval timer.
This item relates to the broader goal of modeling object behaviors in Monty. For more details on time representations and processing, see our future work page on the interval timer.
The LM has expectations of when it will sense the next feature at the current location. This is stored as the interval duration that came in from the global interval timer during learning at the same time as the sensory input.
If the next expected feature at the current location appears earlier than what is stored in the model (i.e. timer input < stored interval), the LM sends a signal to the global timer to speed up (by the magnitude of the difference).
If the next expected feature at the current location appears later than what is stored in the model (i.e. timer input > stored interval), the LM sends a signal to the global timer to slow down.
Note: This might be a noisy process and require voting to work well.
It remains unclear how scale invariance would be implemented at a neural level, although we have discussed the possibility that the frequency of oscillatory activity in neurons is scaled. This could in turn modulate how movements are accounted for during path integration.
Regardless of the precise implementation, it is reasonable to assume that a given learning module will have a range of scales that it is able to represent, adjusting path integration in the reference frame according to the hypothesized scale. This scale invariance would likely have the following properties:
- Heuristics based on low-level sensory input (e.g. inferred distance) that are used to rapidly propose the most probable scales.
- Testing of different scales in parallel, similar to how we test different poses of an object.
- Storing the most commonly experienced scales in long-term memory, using these to preferentially bias initialized hypotheses, related to Use Better Priors for Hypothesis Initialization.
These scales would represent a small sub-sampling of all possible scales, similar to how we test a subset of possible rotations, and consistent with the fact that human estimates of scale and rotation are imperfect and tend to align with common values.
For example, if an enormous coffee mug was on display in an art installation, the inferred distance from perceived depth, together with the size of eye movements, could suggest that - whatever the object - features are separated on the scale of meters. This low-level information would inform testing objects on a large scale, enabling recognition of the object (albeit potentially with a small delay). If a mug was seen at a more typical scale, then it would likely be recognized faster, similar to how humans recognize objects in their more typical orientations more quickly.
Thus, infrastructure for testing multiple scales (i.e. adjusted path integration in reference frames), or bottom-up heuristics to estimate scale, would be useful additions to the learning module.
In addition to the above scale invariance within a single LM, we believe that different LMs in the hierarchy will have a preference for different scales, proportional to the receptive field sizes of their direct sensory input. This would serve a complimentary purpose to the above scale invariance, constraining the space of hypotheses that each LM needs to test. For example, low-level LMs might be particularly adapt at reading lettering/text. More generally, one can think of low-level LMs as being well suited to modeling small, detailed objects, while high-level LMs are better at modeling larger, objects at a coarser level of granularity. Once again, this will result in objects that are of typical scales being recognized more quickly.
In order to make better use of the available computational resources, we might begin by sampling a "coarse" subset of possible hypotheses across objects at the start of an episode. As the episode progresses, we could re-sample regions that have high probability, in order to explore hypotheses there in finer detail. This would serve the purpose of enabling us to have broad hypotheses initially, without unacceptably large computational costs. At the same time, we could still develop a refined hypothesis of the location and pose of the object, given the additional sampling of high-probability regions.
Furthermore, when the evidence values for a point in an LM's graph falls below a certain threshold, we generally stop testing it. Related to this, the initial feature pose detected when the object was first sensed determines the pose hypotheses that are initialized. We could therefore implement a method to randomly initialize a subset of rejected hypotheses, and then test these. This relates to Less Dependency on First Observation.
This work could also tie in with the ability to Use Better Priors for Hypothesis Initialization, as these common poses could be resampled more frequently.
Currently all object poses are equally likely, because stimuli exist in a void and are typically rotated randomly at test time. However, as we move towards compositional and scene-like datasets where certain object poses are more common, we would like to account for this information in our hypothesis testing.
A simple way to do this is to store in long-term memory the frequently encountered object poses, and bias these with more evidence during initialization. A consequence of this is that objects should be recognized more quickly when they are in a typical pose, consistent with human behavior (see e.g. Lawson et al, 2003).
In terms of implementation, this could be done either relative to a body-centric coordinate, through a hierarchical biasing, or both. With the former, the object would have an inherent bias towards a pose relative to the observer, or some more abstract reference-frame like gravity (e.g. right-side up coffee mug). With the latter, the pose would be biased with respect to a higher-level, compositional object. For example, in a dinner table setup, the orientation of the fork and knife would be biased relative to the plate, even though in of themselves, the fork and knife do not have any inherent bias in their pose. This information would be stored in the compositional dinner-set object in the higher level LM, and the bias in pose implemented by top-down feedback to the low-level LM. Such feedback could bias both specific poses, as well as specific locations of the child object relative to the parent object, or specific scales of an object (see also Support Scale Invariance).
This task relates to the on-going implementation of hierarchically arranged LMs. As these become available, it should become possible to decompose objects into simpler sub-object components, which in turn will enable LMs to model objects with significantly fewer points than the ~2,000 per object currently used.
There are a variety of instances where a Monty system has a hypothesis about the current object, and then moves off the hypothesis-space of that object, either sensing nothing/empty space, or another object. For example, this can occur due to a model-free driven action like a saccade moving off the object, or the surface agent leaving the surface of the object. Similarly, a model-based action like the hypothesis-testing "jump" can move an agent to a location where the object doesn't exist if the hypothesis it tested was false.
Currently we have methods to move the sensor back on to the object, however we do not make use of the information that the object was absent at the perceived location. However, this is valuable information, as the absence of the object at a location will be consistent with some object and pose hypotheses, but not others.
For example, if the most-likely hypothesis is a coffee mug and the system performs a saccade that results in the nearest feature being very far away (such as the distant surface of a table), then any hypotheses about the pose of the mug that predicted there would be mug-parts should receive evidence against them. On the other hand, a hypothesis about the mug's pose that is consistent with the absence of the mug at that location should receive positive evidence. The same would apply if the surface agent leaves the surface of the mug, where the absence of mug is consistent with a subset of hypotheses.
A more nuanced instance arises if there is something at the expected location, but it is a feature of a different object. For example, when moving to where the handle of the coffee mug is believed to be, we might sense a glass of water. Again, however, as the sensed features (transparent, smooth glass) are different from those predicted by hypotheses that there was a mug handle there, then these hypotheses should receive negative evidence. On the other hand, this observation of unusual features is entirely consistent with the hypotheses that predicted that the mug would not be there. As such, it should still be possible to adjust evidence values based on how observations match the predictions of each hypothesis.
Here we publish relatively recent parts of our theory that have previously been unpublished. We are still in the process of creating more easy to follow resources for those and have not implemented them in Monty yet.
title: Neuroscience Theory - Overview description: Intro page to the underlying theory of this project.
Warning
This section is still work in progress.
title: Object Behaviors description: Our theory on how object behaviors are represented, learned, and inferred.
The world is both static and dynamic. Some features of the world have a fixed arrangement relative to each other and don’t change over time. For example, the arrangement of edges and surfaces of a coffee cup or the arrangement of rooms in a house do not change from moment to moment. Sometimes, however, the arrangement of features in the world changes over time and in predictable ways. For example, a stapler does not have a fixed morphology; it can open and close and emit staples, a traffic light can change color, a human can move its limbs to run, and a lever may rotate, causing a door to open. We refer to these changes as object behaviors.
Learning the behaviors of objects is an essential component of any intelligent system. If an intelligent system is going to generate novel behaviors, it first needs to learn the behaviors of objects and then enact those behaviors in both familiar and novel situations.
In the TBT, static objects are learned as a set of features at poses (location and orientation) in a reference frame. There is an assumption that the features are not changing or moving, therefore, the existing theory and implementation work well for representing static structure. Here, we describe how to extend the TBT to learn the dynamic structure of the world.
To learn and represent object behaviors, we use the same mechanism as we use for learning static structure. The main difference is what the input features represent. In the static models, the features are also static, such as the edge of a mug. In the dynamic models, the features represent changes, such as the moving edge of a stapler top as it opens. Static features are stored at locations in static/morphology models, and changing features are similarly stored at locations in behavior models. Additionally, behaviors occur over time. As a stapler opens, the locations where moving edges occur change over time. Therefore, behavior models are sequences of changes at locations.
Static models and dynamic models are learned in separate reference frames, but they share sensor patches and how the sensor patch is moving. Therefore, when we observe a stapler, we can simultaneously learn both the morphology of the stapler and how that morphology changes over time. But because the behavioral model has its own reference frame, it exists independently of the stapler. Now imagine we see a new object that doesn’t look like a stapler. If this new object starts to open like a stapler, then the stapler’s behavior model will be recognized and we will predict the new object behaves like a stapler. This method is very general and applies to every type of behavior we observe in the world.
The above figure shows our current theory for how morphology and behavioral models are implemented in the neocortex. Two cortical columns are shown. Each cortical column uses the same modeling mechanism of associating whatever input it receives to L4 with location representations in L6a. The location representations are updated using sensor movement input. The columns are agnostic to what type of input goes to L4. The column on the left receives static features and will learn morphology models. The column on the right receives changing features and will learn behavioral models.
Consider vision. Two main inputs from the eyes to the cortex are the parvocellular and magnocellular pathways. The cells in both pathways have center-surround receptive fields. If the associated patch of retina is uniformly lit, these cells will not fire. If the retinal patch is not uniformly lit, they will fire. The parvocellular cells respond to static, non-moving input patterns. The magnocellular cells respond to moving, non-static patterns. If parvocellular cells are input to L4, the column will learn the static morphology of an object. If magnocellular cells are input to L4, the column will learn the dynamic behavior of an object.
We considered whether the two models could coexist in a single column. Maintaining two reference frames in one column proved complex, and we ultimately decided it was more likely that behavior and morphology models would exist in separate columns. One possibility is that morphology models exist in ventral/what visual regions and behavioral models exist in the dorsal/where visual regions.
Models of dynamic structure consist of sequences of behavioral elements over time, and the timing between elements is often important. Many cells have apical dendrites that extend to L1. We propose that the apical dendrites in L1 learn the timing between events via a projection from matrix cells in the thalamus. (This is the same mechanism we previously proposed as part of the HTM sequence memory.) While static models don’t require learning temporal sequences, they can still use state to predict the current morphology.
The figure below illustrates how state and time relate to the two types of models. The behavioral model on the left stores where changes occur over time. It does not represent the morphology of the object exhibiting the behavior. It represents behaviors independent of any particular object. The morphology model on the right stores where static features occur for a particular object. It may have learned several morphology models for the object, such as an open and closed stapler, but on its own, it doesn’t know how the stapler moves. The concept of state applies to both models. On the left, state represents where the model is in a sequence. On the right, state represents different learned morphologies for an object.
As we have previously demonstrated [2, 7, 10], cortical columns can vote to quickly reach a consensus of the static object being observed. The same mechanism applies to behaviors. Multiple columns can vote to quickly infer behaviors.
Analogous to object models, behavior models can be recognized in any location, orientation, and scale by transforming the physical movement vector into the behavior's reference frame [11]. This allows for recognizing a behavior at different locations on an object in varying orientations and scales, and therefore represents a flexible way to apply and recognize behaviors in novel situations. Notably, the behavior can be recognized independently of the object on which it was learned.
Any particular object may exhibit multiple independent behaviors. For example, the top of a stapler can be raised or lowered and, independently, the stapler deflection plate can be rotated. A coffee maker may have a power switch, a lid to add water, and a filter basket that swings out. Each of these parts exhibits its own behaviors.
In general, any morphological model can exhibit multiple behaviors, and any behavioral model can apply to different morphological models. This is analogous to compositional structure. The theory and mechanism we proposed for how the cortex learns compositional structure were derived for compositions of physical objects, such as a logo on a mug.
We propose that the same mechanism applies to behavioral models. Specifically, as illustrated below, a behavioral object can be learned as a feature of a morphological model in a higher region. This way, a behavior can be associated with different locations on an object. This association also encodes the behavior's orientation and scale relative to the object.
Note that the parent object, on the right in the figure, does not know if the child object is a morphology or behavioral object.
The above examples illustrate that the two modeling systems, morphology/static and behavior/dynamic are similar and share many mechanisms and attributes. This commonality makes it easier to understand and implement the two systems.
In many ways, the two modeling systems are similar. As mentioned, a major difference is that behavioral models require a temporal dimension, whereas our morphology models could do without it so far, although they may also make use of time and represent different states for the same object. Behaviors are high-order sequences, and the time between sequence elements is often important.
Previous work at Numenta showed how any layer of neurons can learn high-order sequences [1]. This mechanism will work well for learning behavioral sequences. In addition, there is a need for learning the time between sequence elements.
Matrix cells in the thalamus could encode the time passed since the last event and broadcast this signal widely across the neocortex. Matrix cells project to L1, where they form synapses on the apical dendrites of L3 and L5 cells, allowing the behavioral model to encode the timing between behavioral states. The static model does not require time but could still learn state/context-conditioned models using the same mechanism. For instance, the stapler's morphology is different depending on whether it is in the opened or closed state.
During inference, the model can be used to speed up or slow down the global time signal. If input features arrive earlier than expected, the interval timer is sped up. If they arrive later than expected, the interval timer is slowed down. This allows for recognizing the same behavior at different speeds.
[!NOTE] For a more detailed description of the time mechanism see our future work page on the interval timer.
We have implemented a software-based system for learning and recognizing static object models based on the Thousand Brains Theory. This implementation is in an open-source project called Monty [9]. The static object models learned by Monty are represented as features (vectors) at locations and orientations in Euclidean space. We then use sensed features and movements (displacements) to recognize an object in any location and orientation in the world. This is done by iteratively updating hypotheses about sensed objects and their poses. The pose hypothesis is used to rotate the sensed displacement into the model’s reference frame. The features stored in the model at the location in the object’s reference frame are then compared to the sensed features. For a detailed explanation of the algorithm, see our publications [7, 8, 12] and documentation [10].
For behavior models, we propose using the same mechanism. The main difference is that the LM modeling behaviors only receives input when changes are detected. This could be a local movement (e.g., a moving edge) or a feature change (e.g., color changing). It therefore encodes changes at locations. We use the same mechanism of iteratively testing hypotheses when inferring a behavior. Analogous to the object recognition mechanism, we apply a pose hypothesis to the sensed sensor displacement to transform it into the behavior model’s reference frame. We can then compare the sensed change to the stored change at that location.
For a concrete implementation in tbp.monty, we need to add a capability to sensor modules to detect changes. Those could, for example, be local optic flow (indicating movement of the object) or features appearing or disappearing. Analogous to static features, these changes would be communicated to learning modules as part of the CMP messages. The static mechanisms are what we have implemented to date. We would use the same mechanisms for learning and inference for the behavior model, only that the LM receives input from an SM that detects changes.
Additionally, models require a temporal input as well as conditioning the state of the model based on this input. This could be implemented as multiple graphs or sets of points in the same reference frame that are traversed as time passes. There are many possible ways to achieve this in code. The important thing is that the state in the temporal sequence can condition the changes to expect at a location.
The inferred state of models is then communicated in the CMP output of learning modules, both for voting and for passing it as input to the higher-level LM.
Recognizing a behavior model in practice will likely depend more strongly on voting between multiple learning modules (at least for more complex behaviors where one sensor patch may not be able to observe enough of the behavioral state on its own before it changes). The voting process for behavior models would work analogously to the voting process already implemented for object models.
- Learn models of object behaviors
- Recognize object behaviors (+pose & state) independent of morphology
- Learn associations between behavior & morphology models (using hierarchy) to speed up recognition + learn compositional models
- Compensate for object movement to make accurate predictions in the morphology model (using behavior model and/or model-free signals)
- Use behavior models to inform actions (+ learn how actions influence state)
We can start with 1 and 2 for the first prototype. 3 should fall out of that solution (+ work on modeling compositional objects in general). 4 and 5 still have many unresolved questions and can be added in a second phase.
The following items focus on concrete steps to get capabilities 1-3 into Monty.
- New type of SM for change detection.
- Add 'state' to Monty's model representation.
- Add time interval representation to Monty
- Object behavior testbed
- Set up a basic test environment to learn and infer behaviors under various conditions.
- Set up a basic Monty configuration with one change-detecting SM and one static feature SM, and connect it to two LMs. Evaluate Monty's performance on the object behavior test bed.
See this video for a walkthrough of the above capabilities and action items.
For in-depth descriptions of the invention presented here, see the series of meeting recordings in which we conceived of the idea, formalized the general mechanism, and discussed its implementation in the brain and in Monty.
You can find the whole Playlist here:
https://www.youtube.com/playlist?list=PLXpTU6oIscrn_v8pVxwJKnfKPpKSMEUvU
We are continually adding more videos to this playlist as we continue to explore the remaining open questions.
[1] Hawkins, J., & Ahmad, S. (2016). Why neurons have thousands of synapses: A theory of sequence memory in neocortex. Frontiers in Neural Circuits, 10, Article 23. https://doi.org/10.3389/fncir.2016.00023
[2] Hawkins, J., Ahmad, S., & Cui, Y. (2017). A theory of how columns in the neocortex enable learning the structure of the world. Frontiers in Neural Circuits, 11, Article 81. https://doi.org/10.3389/fncir.2017.00081
[3] Hawkins, J., Lewis, M., Klukas, M., Purdy, S., & Ahmad, S. (2019). A framework for intelligence and cortical function based on grid cells in the neocortex. Frontiers in Neural Circuits, 12, Article 121. https://doi.org/10.3389/fncir.2018.00121
[4] Hawkins, J. (2021). A Thousand Brains: A New Theory of Intelligence. Basic Books. ISBN 9781541675810. URL: https://books.google.de/books?id=FQ-pzQEACAAJ.
[5] Hawkins, J. C., Ahmad, S., Cui, Y., & Lewis, M. A. (2021). U.S. Patent No. 10,977,566. Inferencing and learning based on sensorimotor input data. Washington, DC: U.S. Patent and Trademark Office.
[6] Hawkins, J. C., & Ahmad, S. (2024). U.S. Patent No. 12,094,192. Inferencing and learning based on sensorimotor input data. Washington, DC: U.S. Patent and Trademark Office.
[7] Clay, V., Leadholm, N., and Hawkins, J. (2024). The thousand brains project: A new paradigm for sensorimotor intelligence. URL: https://arxiv.org/abs/2412.18354.
[8] Hawkins, J. C., Ahmad, S., Clay, V., & Leadholm, N. (2025). Architecture and Operation of Intelligent System. U.S. Patent Application No. 18/751,199.
[9] Monty code: https://github.com/thousandbrainsproject/tbp.monty
[10] TBP documentation: https://thousandbrainsproject.readme.io/docs/welcome-to-the-thousand-brains-project-documentation
[11] Hawkins, J., Leadholm, N. and Clay, V. (2025) Hierarchy or Heterarchy? A Theory of Long-Range Connections for the Sensorimotor Brain URL: https://arxiv.org/pdf/2507.05888
[12] Leadholm, N., Clay, V., Knudstrup, S., Lee, H., and Hawkins, J. (2025) Thousand-Brains Systems: Sensorimotor Intelligence for Rapid, Robust Learning and Inference URL: https://arxiv.org/abs/2507.04494












































































































