In Full...
On Tuesday, Sundar Pichai opened the Google I/O keynote:
”A year ago on the I/O stage we first shared our plans for Gemini: a frontier model built to be natively multimodal from the beginning, that could reason across text, images, video, code, and more. It marks a big step in turning any input into any output — an “I/O” for a new generation.”
Multimodal AI has been an obsession of mine for years. When we launched Playbox, the premise was clear: all content mediums should be interchangeable. The vision of converting a video to an article or vice versa was tantalisingly close, yet elusive in 2020. Fast forward to today, and it appears we were more prescient than overly optimistic.
OpenAI’s event (notably announced with very short notice to occur less than 24 hours before Google’s) on Monday showcased a new, faster model: GPT-4o. This appears to be the first truly multimodal model; it takes inputs of audio, video, text, or images directly, without an intermediary translation layer. Previous models required separate steps to handle different types of data, but GPT-4o integrates these capabilities natively.
The standout feature was the voice interaction. One of the challenges we faced with Playbox was the inadequacy of text-to-speech (TTS) technology at the time. Our custom models could imitate a person’s voice, but the result was often slightly robotic. The initial version of ChatGPT’s voice mode showed promise, but I wanted it to be more human. The latest model, however, is almost disconcertingly human. Gone are the long, uninterruptible monologues after a brief wait. Now we have instantaneous responses, shorter answers, and the ability to cut off and redirect the AI simply by speaking. It is also, curiously, quite flirtatious. The voice interaction feels natural, with subtle meandering and emotional nuance, eerily similar to Scarlett Johansson’s voice in the movie ‘her’.1
This model, and Google’s evolution of the Gemini model mean that AI can convert most digital inputs into human-readable digital outputs with remarkable consistency, albeit still with some errors. Whether it’s text, an image, a video, or sound, the AI can interpret and transform it into the desired format. That output however is by its nature creative and accommodates a degree of inaccuracy—success metrics for most outputs here are often relatively subjective.2 This “creativity” of course is the result of probabilistic computing and is baked into the very nature of how AI works. Most of real life however is deterministic; a given set of inputs should result in a given output when an operation is applied.
AI and 3D
Hidden within the GPT-4o launch article was an intriguing development: the model’s ability to produce 3D renderings.
A spinning 3D OpenAI Logo
The documentation mentions a 3D reconstruction based on six keyframes but lacks detail on the creation process. It suggests two possibilities:
- GPT-4o might include video capabilities to produce a consistent 3D-like video by interpolating between 2D frames.
- Alternatively, GPT-4o might generate actual 3D files from images.
While generating interpolated 2D frames is impressive, creating true 3D models would be revolutionary, not least because accurate 3D models require a level of determinism that we have not seen previously within large language models.
Bringing AI Outputs into the Real World
If a model can take keyframes and build a 3D model from them, the implications for AI-generated outputs are immense. This advancement transitions us from digital constraints to generating tangible objects in the real world.
Over the past decade, 3D printing has become increasingly mainstream. From hobbyists’ spare bedrooms to Formula 1 labs, 3D printers enable rapid iteration and production of physical objects. Although industry and early adopters have embraced the technology, it has not yet become commonplace. This is primarily due to two factors: a) 3D modeling is complex and has a steep learning curve, and b) 3D printers are still not as user-friendly as the mass consumer market demands.
An AI model capable of accurately producing or reproducing physical objects from limited information could not only lower the barrier to creating 3D models for printing but eliminate it entirely. By combining this capability with more detailed input data—such as photos enhanced with mm-accurate LIDAR scans from high-end phones or stereoscopic views from multiple cameras—almost anyone could create 3D objects simply by taking a photo and sending it to the AI for reconstruction.
The major hurdle to overcome is the development of a 3D printer that is effortless to use. However, if creating 3D objects becomes easy for consumers, it is likely that printer manufacturers will rise to meet the demand, similar to how the rise of digital photography led to a surge in high-quality home photo printers.
As with any AI conversation, significant legal and ethical implications arise. If consumers can recreate objects simply by taking a photo and sending it to a printer, concerns about infringement will extend beyond artists, as seen with image generators like Midjourney and DALL-E. Companies will also worry, recognising that non-complex products could be easily replicated at home. Furthermore, questions will arise about the accountability of companies behind AI models that produce 3D objects, especially if an AI-designed device causes injury.
Whether GPT-4o can create true 3D objects or not, it feels like we are on the brink of a revolution. The convergence of multiple complementary technologies heralds an era where creating almost any output is becoming increasingly accessible. While we are not quite in the realm of Star Trek’s replicator, we are getting closer. The potential for AI to transform how we create and interact with both digital and physical worlds is enormous, and the implications—both exciting and challenging—will shape our future in profound ways.
Footnotes
-
If you haven’t seen ‘her’ it’s definitely worth a watch given how closely reality appears to hewing to reality of the film. ↩
-
Yes, AI is very good at generating code, which one could argue either objectively works or doesn’t but most often any significant code block that is generated by an AI needs a fairly high amount of human input to make sure it is actually fit for purpose… ask me how I know. ↩