The AI explosion is warping our sense of time. Can you believe Stable Diffusion is only 4 months old, and ChatGPT <4 weeks old 🤯? If you blink, you miss a whole new industry. Here are my TOP 10 AI spotlights, from a breathtaking 2022 in rewind ⏮: a long thread 🧵
2022’s AI landscape is dominated by a surge in huge generative models, which are rapidly making their way out of research labs and into real-world applications. 2 other emerging areas driven by LLM technology are decision-making agents (games, robotics, …) and AI4Science.
I am very fortunate to work on these AI research frontiers with my wonderful colleagues @NVIDIAAI. In 2023, I’ll continue to do more exciting works myself and share hot ideas too - welcome to follow me @DrJimFan! 🙌

For each of the 10 spotlights, there may be multiple works:
🎉No. 1: Text -> Image

DALLE-2 was the first large-scale diffusion model that can generate realistic, high-res images from an arbitrary caption. It kickstarted the AI4art revolution that spawned many new applications, startups, and ways of thinking.
But DALLE-2 is behind OpenAI’s walled garden. @StabilityAI, LMU, and @runwayml took the heroic step to train their own internet-scale text2image model, based on the “latent diffusion” algorithm. They called the model “Stable Diffusion”, and open-sourced the code & weights.
The open access to Stable Diffusion has proven to be a huge game changer. Numerous startups and research labs build upon SD to create novel apps, and SD itself gets improved continuously by the open-source community. SD has recently hit v2.1 and runs on a single GPU now!
There were 2 other image2text models from @GoogleAI this year. Neither released the model or an API to play with, but the papers still had interesting insights.

1. Imagen:
2. Parti: A transformer model without diffusion!

🎉No. 2: Text -> Text

Well, that’s an easy guess - ChatGPT! The only app in history that gained 1 million users in 5 days.

ChatGPT prompted out our human creativity as well. I refer you to this list for all the useful & imaginative ChatGPT ideas:
ChatGPT & GPT-3.5 both use a new technology called RLHF (“Reinforcement Learning from Human Feedback”). Its profound implication is that prompt engineering will disappear very soon. See my deep dive 🧵 on this topic:


ChatGPT’s popularity spawned a wave of new startups and competitors, notably Jasper Chat, YouChat, @Replit Ghostwriter chat, and @perplexity_ai. Some of them offer much more intuitive ways to do search, so Google execs are getting sweaty! @goodside has a nice thread:


🎉No. 3: Text -> Robot🤖

How to give GPT arms and legs, so they can clean up your messy kitchen? Unlike NLP, robot models need to interact with a physical world. Big pre-trained transformers are finally starting to address the hardest problems in robotics this year!

In October, my coauthors and I took a step towards building a “Robot GPT” that takes in any *mixture* of text, images, and videos as prompt, and outputs robot motor control! Our model is called VIMA (“VisuoMotor Attention”) and has been fully open-sourced 🧵:


Along a similar path as VIMA, researchers from @GoogleAI announced RT-1, a robot transformer trained on 700 tasks and 130K human demonstrations. These data were collected by a literal *Iron Fleet* - 13 robots over 17 months!

I also wrote a deep dive 🧵 for RT-1:


🎉No. 4: Text -> Video

Video is just an array of images bundled together over time, creating the illusion of motion. If we can do text2image, then why not throw in the time axis for some extra fun?
There are 3 (!!) big works in this area, but none of them open-source 😢

Make-A-Video: Text-to-Video Generation without Text-Video Data. From @MetaAI.

You can sign up for trial access here:


Imagen Video: High Definition Video Generation with Diffusion Models. From @GoogleAI - a natural follow-up on the Imagen static image generator.


Phenaki: Variable Length Video Generation From Open Domain Textual Description. Also from @GoogleAI

🎉No. 5: Text -> 3D

From designing innovative products to creating stunning visual effects in movies & games, 3D modeling will be the next step in the creative AI field to materialize ideas from text. 2022 has seen a few primitive but promising 3D generative models!


DreamFusion: Text-to-3D using 2D Diffusion. From @GoogleAI. Based on the NeRF algorithm, the 3D model generated from a given text can be viewed from any angle, relit by arbitrary illumination, or composited into any 3D environment.

Project site:

2 works from my colleagues @NVIDIAAI:

👉 GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images.
👉 Magic3D: High-Resolution Text-to-3D Content Creation.


Point-E: A System for Generating 3D Point Clouds from Complex Prompts. By @OpenAI, a preliminary version of 3D DALLE!



🎉No. 6: AI can now play Minecraft!

The game is a perfect testbed for general intelligence because (1) it is infinitely open-ended & creative; (2) it’s played by 140M people - twice UK’s population, and a treasure trove of human data!

Can AI be as imagnative as we are?

My coauthors and I developed the first Minecraft AI that can solve many tasks given a *natural language prompt*. Our ultimate goal is to build an “Embodied ChatGPT”. We’ve fully open-sourced our development platform “MineDojo”. Deep dive 🧵 on our NeurIPS Outstanding Paper:


Concurrently, @jeffclune’s team also announced a model called VPT (“Video Pre-Training”) that directly outputs keyboard and mouse actions. It is able to solve longer horizons, but not language conditioned. MineDojo & VPT complement each other!

🎉No. 7: AI learns to negotiate!

CICERO @MetaAI is the first AI agent to achieve human-level performance in Diplomacy, a strategy game that requires extensive natural language negotiation to cooperate & compete with humans. AI can now persuade and bluff effectively 😮!


Concurrently, @DeepMind also announced their Diplomacy-playing agent. What would happen if CICERO plays DeepMind’s AI? 🤔

🎉No. 8: Audio -> Text

OpenAI Whisper is a large Transformer that approaches human-level robustness & accuracy on English speech recognition. It’s trained on 680,000 hours of audio data from the web! Will Whisper unlock more text tokens to feed GPT-4? 8/
🎉No. 9: Nuclear Fusion control☢️

@DeepMind & @EPFL developed the first deep reinforcement learning system that can keep nuclear fusion plasma stable inside its tokamaks (a device that uses a powerful magnetic field to confine plasma in a torus).

Also this month, Department of Energy announced a huge breakthrough: nuclear fusion now generates more energy than it consumes to initiate the reaction! It is the first time humankind has achieved this landmark. We may become a fusion-powered civilization in this lifetime!

🎉No. 10: Biology Transformers🧬

AlphaFold (2021) was the first model to predict the 3D structure of a protein accurately. In July, DeepMind announced the “protein universe” - expanding AlphaFold’s protein database to 200M structures! What a treasure chest for science!

@NVIDIAAI also expands the BioNeMo LLM framework to help biotech companies and researchers to generate, predict and understand biomolecular data.

This concludes our whirlwind tour of the top 10 AI highlights of 2022! There‘re countless other exciting works that contribute to these advancements, too many to fit in a 🧵. I believe every paper is a brick in the cathedral of AI, and all the efforts should be celebrated🥳. 11/
Parting thought: as AI systems are growing ever more powerful, it is crucial that we remain aware of the potential dangers & risks and take steps to mitigate them, whether through careful training design, proper deployment oversight, or novel safeguard approaches. 12/
Here's to a year filled with so many moments that took our breath away! 🥂🍻🍾
Happy holidays everyone. Follow me for more deep dives in 2023 🙌


Recommended by
Recommendations from around the web and our community.

This thread is incredible.