Apple Intelligence: Context Matters

Today, Apple announced the arrival of Apple Intelligence (AI) – the long-awaited integration of Artificial Intelligence (AI) into their main product lineup of iOS devices. Accompanied by the faint whisper of 100 more AI app startups taking their last breath, the announcement was a declaration of intent; the next version of iOS will be deeply infused with AI. In my previous exploration, I posited that the utility of an AI assistant depends on its ability to have personal context and deep knowledge about my life and preferences. Without that understanding, AI assistants are of relatively little use.

The most promising element of AI for me, therefore, is the depth of integration and access promised for the AI. Apple’s approach is unique – by allowing what is, in effect, two AIs to work side by side, with a private, local-only AI having access to data on your phone and a secondary (OpenAI-powered) AI acting as a higher-powered interaction layer, they have created a proposition that could actually achieve what I had hoped for in Apple and AI.

As a result of the promised deep integration, Apple Intelligence will be able to access a level of personal context that apps just can’t match. And that ability to access personal context without requiring direct user input really matters. To understand why it matters so much, we need to consider how AIs obtain, retain, and process information.

Traditional Computing vs AI

In computing, there have traditionally been two main ways that information is stored: on a hard drive and in RAM (Random Access Memory). In the mental model that we have of our own memory, a hard drive stores long-term memories—knowledge that you know, a visual scene of your 12th birthday party, your date of birth—and RAM stores your working or short-term memory—what you need to get from the shops that evening, or what you went to the kitchen to do.

AI models work quite differently from a traditional computer. Yes, the models are stored on a hard drive, and most of the work is done “in memory,” but the model doesn’t specifically draw on files to access information. Instead, models effectively have two separate elements: the statically trained model itself and the ‘context window’.

Long-Term Memory: Training Data

When we ‘train’ a model, we provide it with what is described as a ‘pile’ of information. This pile, or large dataset, is what the model learns patterns and information from. Effective and extensive training gives the model behind ChatGPT (other models are available) the ability to accurately answer when World War II occurred, for example, based not on a file that tells it the correct dates but from the patterns it learned from within the training data.

Whilst it is being trained, the model learns the structure of language from the corpus, which is what gives it the ability to respond correctly in different languages. There are other processes that enhance the model’s performance, but this provides a basic understanding of how information is embedded into a model.

The way a model retrieves information and constructs sentences for its answers are intertwined processes. It generates an answer that is probabilistically correct for the given query, based on the patterns and knowledge it has learned from the training data.

This approach works quite well. We now have models that have been trained on extremely large datasets that appear to know a lot of information and have a great command of many languages—these are the models at the forefront of generalised AI. The training process for a model is relatively slow, though, and (at the moment) that training can’t be kept continuously updated, so even the most recently released models have a relatively out-of-date knowledge cut-off date. The information stored in the model is also impersonal in nature – it doesn’t specifically know about you and your life, and when you do directly provide it with information, you are not adding to the model itself. Instead, you are working within what we consider to be the AI’s short-term memory.

Short-Term Memory: Context Windows

This short-term memory is the ‘context window’. The term refers to the amount of new information that an LLM can understand at once. Whilst it appears that we are in a back-and-forth conversation with a model when we’re using these tools, in fact, we are starting afresh with every request each time we send a new message to ChatGPT. What is happening behind the scenes is that the entire conversation to that point is being sent to the model each and every single time, to provide it with the historical context of the conversation. The model itself remains unchanged by your inputs; it is just provided with the greater context afforded by having the previous chat.

Context windows are called windows for a reason, though; they have bounds, and beyond a certain point, the historic conversation will be cut off to fit within the window. For GPT-4, that point is at 128,000 tokens, or about 96,000 words. For Gemini 1.5 Pro, it’s as much as 2 million tokens or around 1.5 million words. This sounds like a lot, but in the context of your phone’s data, it barely scratches the surface.

There is another approach though; providing the AI with the ability to pull in data from other authoritative data sources that are relevant in the moment - we call this Retrieval-Augmented Generation (RAG). By pulling in data based on relevance to a given query from other sources, the limitations of both a lack of trained knowledge and a small context window can be overcome. And it is this approach that I had hoped that Apple would take from when I first discussed this earlier this year.

The Apple Difference

I’ve previously expressed scepticism regarding AI assistants, partly because current versions of LLMs lack the personal contextual awareness I would want from an assistant. Even where an assistant is able to persist ‘important’ information, there are so many pieces of personal context that an AI without access to your entire life simply can’t have. That context is stored in messages, calendars, emails or even just the places that we’ve been to or currently are. To achieve the level of context required to be truly useful without needing to be constantly updated by the user requires something that few devices can actually have; a persistent and all-seeing presence in your life. It turns out that the best way to have that persistent presence is to be your most persistently present personal device; your phone.

What makes AI exciting is its ability to integrate personal context into the model without needing to directly provide the information. By drawing from its access to your phone’s personal data, the AI will be able to understand more about your life than any other assistant would be able to. This is as a direct result of the positioning of the AI within the stack – it sits at the OS level and can therefore do things that no standalone app could ever do.

Apple has also spent no small amount of time and effort focussing on not only branding itself as a privacy-conscious platform, but also ensuring that the choices that it makes within its platforms are genuinely driven by privacy. They have built a level of user trust that is unmatched. That Apple is able to run their AI on-device matters – personal information isn’t shared to the cloud and stays safely, and securely, on-device. It really is only Apple that can provide this due to their unique approach to hardware and software.

If iOS 18 delivers on today’s promise, I’m excited. This is the first time that I’ve felt any company has truly understood how to approach AI in a way that is genuinely helpful and understands what an AI assistant needs to be more than just a toy. If you thought I was bullish on Apple prior to this event…

AI Inputs and Outputs in the Real World

On Tuesday, Sundar Pichai opened the Google I/O keynote:

”A year ago on the I/O stage we first shared our plans for Gemini: a frontier model built to be natively multimodal from the beginning, that could reason across text, images, video, code, and more. It marks a big step in turning any input into any output — an “I/O” for a new generation.”

Multimodal AI has been an obsession of mine for years. When we launched Playbox, the premise was clear: all content mediums should be interchangeable. The vision of converting a video to an article or vice versa was tantalisingly close, yet elusive in 2020. Fast forward to today, and it appears we were more prescient than overly optimistic.

OpenAI’s event (notably announced with very short notice to occur less than 24 hours before Google’s) on Monday showcased a new, faster model: GPT-4o. This appears to be the first truly multimodal model; it takes inputs of audio, video, text, or images directly, without an intermediary translation layer. Previous models required separate steps to handle different types of data, but GPT-4o integrates these capabilities natively.

The standout feature was the voice interaction. One of the challenges we faced with Playbox was the inadequacy of text-to-speech (TTS) technology at the time. Our custom models could imitate a person’s voice, but the result was often slightly robotic. The initial version of ChatGPT’s voice mode showed promise, but I wanted it to be more human. The latest model, however, is almost disconcertingly human. Gone are the long, uninterruptible monologues after a brief wait. Now we have instantaneous responses, shorter answers, and the ability to cut off and redirect the AI simply by speaking. It is also, curiously, quite flirtatious. The voice interaction feels natural, with subtle meandering and emotional nuance, eerily similar to Scarlett Johansson’s voice in the movie her’.1

This model, and Google’s evolution of the Gemini model mean that AI can convert most digital inputs into human-readable digital outputs with remarkable consistency, albeit still with some errors. Whether it’s text, an image, a video, or sound, the AI can interpret and transform it into the desired format. That output however is by its nature creative and accommodates a degree of inaccuracy—success metrics for most outputs here are often relatively subjective.2 This “creativity” of course is the result of probabilistic computing and is baked into the very nature of how AI works. Most of real life however is deterministic; a given set of inputs should result in a given output when an operation is applied.

AI and 3D

Hidden within the GPT-4o launch article was an intriguing development: the model’s ability to produce 3D renderings.

A spinning 3D OpenAI Logo

A spinning 3D OpenAI Logo

The documentation mentions a 3D reconstruction based on six keyframes but lacks detail on the creation process. It suggests two possibilities:

  • GPT-4o might include video capabilities to produce a consistent 3D-like video by interpolating between 2D frames.
  • Alternatively, GPT-4o might generate actual 3D files from images.

While generating interpolated 2D frames is impressive, creating true 3D models would be revolutionary, not least because accurate 3D models require a level of determinism that we have not seen previously within large language models.

Bringing AI Outputs into the Real World

If a model can take keyframes and build a 3D model from them, the implications for AI-generated outputs are immense. This advancement transitions us from digital constraints to generating tangible objects in the real world.

Over the past decade, 3D printing has become increasingly mainstream. From hobbyists’ spare bedrooms to Formula 1 labs, 3D printers enable rapid iteration and production of physical objects. Although industry and early adopters have embraced the technology, it has not yet become commonplace. This is primarily due to two factors: a) 3D modeling is complex and has a steep learning curve, and b) 3D printers are still not as user-friendly as the mass consumer market demands.

An AI model capable of accurately producing or reproducing physical objects from limited information could not only lower the barrier to creating 3D models for printing but eliminate it entirely. By combining this capability with more detailed input data—such as photos enhanced with mm-accurate LIDAR scans from high-end phones or stereoscopic views from multiple cameras—almost anyone could create 3D objects simply by taking a photo and sending it to the AI for reconstruction.

The major hurdle to overcome is the development of a 3D printer that is effortless to use. However, if creating 3D objects becomes easy for consumers, it is likely that printer manufacturers will rise to meet the demand, similar to how the rise of digital photography led to a surge in high-quality home photo printers.

As with any AI conversation, significant legal and ethical implications arise. If consumers can recreate objects simply by taking a photo and sending it to a printer, concerns about infringement will extend beyond artists, as seen with image generators like Midjourney and DALL-E. Companies will also worry, recognising that non-complex products could be easily replicated at home. Furthermore, questions will arise about the accountability of companies behind AI models that produce 3D objects, especially if an AI-designed device causes injury.

Whether GPT-4o can create true 3D objects or not, it feels like we are on the brink of a revolution. The convergence of multiple complementary technologies heralds an era where creating almost any output is becoming increasingly accessible. While we are not quite in the realm of Star Trek’s replicator, we are getting closer. The potential for AI to transform how we create and interact with both digital and physical worlds is enormous, and the implications—both exciting and challenging—will shape our future in profound ways.

Footnotes

  1. If you haven’t seen ‘her’ it’s definitely worth a watch given how closely reality appears to hewing to reality of the film.

  2. Yes, AI is very good at generating code, which one could argue either objectively works or doesn’t but most often any significant code block that is generated by an AI needs a fairly high amount of human input to make sure it is actually fit for purpose… ask me how I know.

Will AI kill sites like Booking.com? AI as a Personal Aggregator

I talked a little bit about my thoughts on AI Personal Assistants previously. I find the idea very exciting! My general sense is that they are more complex to implement than people first imagine, but will be brilliant for relatively simple tasks that don’t involve a high degree of personal preference. What I didn’t explore in that article is the effect of their successful widespread adoption.

AI as a Personal Assistant

The interaction model suggested for these personal assistants is fairly straightforward: you say “Get me a taxi there” or “Bring me a pizza here” and any AI Assistant worth its salt will be able to translate your request into a series of necessary actions and negotiate with whichever third-party services are required to achieve the desired outcome.

If we ask an AI assistant to book a taxi, the AI itself will then interact with another service, whether via an API or (as the rabbit r1 does) directly interacting with websites and apps on your behalf, parsing the result and then feeding that back to you in the form of your conversation.

Large Action Model
The rabbit r1 has a “Large Action Model” (LAM). The LAM translates your request into an action that it can perform on any given app, with the ability to learn how to perform actions on Apps that it hasn’t already been trained on.

For the rabbit r1, that means connecting your existing services to the platform. If you want your AI to be able to book a taxi then you will need to connect your Uber account first. The same with DoorDash for a pizza or Spotify for music.

That appears to be a perfectly sensible approach. We hand over control of our other accounts so that we can let our assistants use them on our behalf, as we might a human assistant. Now, instead of tapping on a screen, we’ll just speak or type and get exactly what we want from the connected apps.

Under the above model, we’re applying an existing and accepted paradigm to the new technology. In this case, we’re replicating what a human would do; open an app and get a result. The perceived benefit is in our own interaction – or lack thereof – with the end service.

From Stratechery:

When it comes to designing products, a pattern you see repeatedly is copying what came before, poorly, and only later creating something native to the medium.

It feels like this is what is happening here. Connecting the AI to an app is just a copy of what came before. So, what is the native approach that is only possible as a result of the new technology?

Well, the huge benefit of an AI is not that it can simply replicate human functionality as a human assistant would, it’s that it can blend that with the more “traditional” functionality of computers that humans are not necessarily good at.

One of those things is doing multiple things, at the same time, very quickly.

Aggregators and the Status Quo

It won’t surprise you to learn that very few services on the internet are truly benevolent gifts to society. Wikipedia perhaps. Archive.org, maybe.

Most websites are balancing their own motivation to make money with the quality of the service that they provide to users. Google is great, but the days of 10 blue links are long gone – the results for any mildly competitive search term are headed with at least a screen of ads. Price comparison sites like Booking.com appear to exist to make customers’ lives easier but their revenue comes from the supply side, typically in the form of referral fees or ads.

Ben Thompson’s Aggregation Theory defines how these businesses work and the framework that they operate under. The internet is full of examples of companies that have, by securing both sides of a market, built incredibly successful businesses. One of his primary examples of an aggregator is Uber.

The initial technical idea behind Uber (“a taxi at the tap of a button, with live tracking on a map and easy payment”) was a good one! At launch, it was also relatively difficult to replicate. But not impossibly hard. Uber knew this; technology was not a moat on which it could rely. To build a real moat, they needed to invest in aggregating both sides of the market:

Supply (Driver Network): Uber spent a lot of money rapidly building an extensive driver network to serve its app, ensuring there were always drivers available. They did this initially by paying very well, with high fees on journeys, bonuses and referral rewards.

Demand (Passengers): Uber aggressively invested in marketing and branding to build its customer base, especially in building referral networks (at one point I had hundreds of pounds of credit on the app, just from these referrals). Uber’s prices when they launched were also ridiculously low. When Uber launched in London, a journey in an Uber Black Mercedes S-Class (the only Uber that was available) was cheaper than a Black Cab.

Uber spent billions of dollars executing all of these strategies and through its aggregation of the supply of taxis and passengers in a given locale, “won” many major cities. But that win came at a huge cost, subsidised by investors. Eventually, it had to normalise costs; driver pay went down and passenger costs went up. The moat that they’d built was wide and deep, but competitor services popped up and built their own networks by offering similar incentives as Uber had.

The key element for Uber though is that it still has the most users, who return to the app because it still has the most available taxi drivers, who in turn return to the app because it has a consistent supply of customers.

The actual service that Uber offers though is not highly differentiated. In any given major metropolitan area, there are at least 4 or 5 significant apps that provide the same service – routing a Toyota Prius to the customer to take them where they want to go – at around the same price point. Ultimately their product is a commodity, with little differentiation from other competitive apps. They, to some extent, rely on customers’ (a) habitual use of their app and (b) lack of motivation to check every service at once for the best price/arrival time.

What happens though if we introduce an intermediary into the equation which is motivated to check every service for the best deal? Say, an AI which can do multiple things, at the same time, very quickly?

AI as a Personal Aggregator

This is the big paradigm shift: introducing the Personal Aggregator. Instead of a human user using one platform to look for a taxi, why would AI not look at all of the platforms to find the best match?

Multi_App.png

Of course, this doesn’t apply just to taxis. A sufficiently powerful AI will be able to rely not just on the results of booking.com, Amazon (not technically an aggregator) or DoorDash, but the entire internet.

The fundamental shift here is in the motivation of AI itself. An AI Assistant is likely part of the user’s personal infrastructure – not that of a third party motivated by the desire to make profit. Of course, there is an assumption here, but I think it’s a fair one. The early generalised intelligence that we’re seeing produced by the likes of OpenAI isn’t focussed on doing just one task well, it does everything, from history homework and language translation to coding and eventually price comparison across the internet.

With a motivation of always returning to the user the best possible deal for a given stay, product or other service – without concern for who the seller is - and having looked at the entire market, an AI assistant stands to be the ultimate aggregator.

As part of our personal infrastructure, the motivation of the AI will be to go wherever it needs to to get the right deal, using whatever blend of requirements that the user has provided it with along with the historical preferences that it has learnt about you. For our taxi example, one person’s personal aggregator may aim for the cheapest possible car at the cost of comfort and expediency. The same AI acting as a personal aggregator for someone else might prioritise comfort over cost and so on.

The fundamentals of how we intertact with a huge array of businesses could completely shift, not just because using the AI is easier, but also because it is the best way to secure the best deal. The widespread adoption of this type of interaction with an AI as personal aggregator could have massive consequences for all sorts of businesses, and not just the existing aggregators.

If we go beyond the creation of an increasingly competitive (but arguably more efficient) market, the intermediating effect of the AI will disrupt one of the core tenets of Aggregator Theory: that of the direct relationship with the end user. When that direct relationship is removed in favour of the ease of use and better user experience of the AI, it is not just the transaction that is being intermediated, nor the buying decision, it is also the pre-buying decision. What is the point in marketing a commodity product from Store X to buyers if they’re going to buy it via their AI?

It may feel that we’re a while off this reality, but as Bill Gates said “We always overestimate the change that will occur in the next two years and underestimate the change that will occur in the next ten”.

The effects on the markets that these commodity products exist within will be drastic, but there will be an increased incentive for differentiation to occur.

Embracing the New Paradigm

The advent of AI as a personal aggregator heralds a transformative era in how we interact with and perceive the digital marketplace. This shift isn’t just about convenience or efficiency; it’s about a fundamental change in the power dynamics between service providers, aggregators, and consumers. AI won’t necessarily be great at telling us what to buy, but it will be at telling us where to.

For decades, aggregators have thrived by simplifying choice and monopolising user attention. They’ve dictated market terms, often at the expense of both service providers and consumers. However, as AI steps in as the ultimate intermediary, it will potentially democratise access to the market. By evaluating all available options to identify the best deal, AI will disrupt the traditional aggregator model, shifting the focus back to the quality and value of the service itself.

This change will compel service providers to innovate and differentiate. Yes, initially there may be a race to the bottom in terms of price but differentiation will need to occur instead as a result of focusing on enhancing quality and user experience.

Moreover, the rise of AI assistants as personal aggregators raises important questions about the future of marketing and customer relationships. As AI becomes the gatekeeper of consumer choices, traditional marketing strategies might lose their efficacy. Businesses will need to adapt by finding new ways to engage with both AI systems and their users, focusing more on the intrinsic value of their offerings rather than just visibility and brand recognition.

The emergence of AI as a personal aggregator is not just a technological advancement; it is potentially a catalyst for a more equitable and consumer-centric marketplace. It will challenge existing business models and forces a rethink of market strategies across various sectors. We’re not there quite yet, but we are standing on the brink of a paradigm shift and it is essential for businesses to understand and adapt to these changing dynamics. The future promises an era where choice is not just about availability, but about relevance and quality, in which consumers are hopefully empowered by AI like never before.

Generative Video: Targeting You Soon

On Thursday OpenAI released Sora, their text-to-video model. It is remarkable, and I plan to write more about it and the future of video more generally in a forthcoming article.

Here’s an example of what it can do…

Prompt:

A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

Output:

This video was generated completely by the AI model. No 3D rendering or other inputs. It is, simply put, lightyears ahead of what we’ve seen from the likes of Google’s Lumiere and Runway’s Gen-2 which were the state of the art.

The most impressive details aren’t on the main Sora page though, they’re hidden away in the technical report. Not only can Sora generate entirely fresh video as shown above, but it can also transform existing videos significantly, based on a text prompt:

Input:

Output:

change the setting to the 1920s with an old school car. make sure to keep the red color

There are obviously huge implications to this for video entertainment, but one of the more fascinating aspects of this for me is in what it means for ad creative.

Yes, we might not like video ads but they are sadly here to stay. One of the reasons we don’t like them (beyond the interruption) is that they often feel so un-targeted and repetitive. That’s not entirely surprising given that the cost of making a 30-second commercial runs into the 10s of thousands of dollars let alone the actual time it takes to produce.

From the preview that we saw on Thursday though, these videos can be made in less than 10 minutes and at significantly reduced cost. Whilst the prospect of seeing hyper-targeted videos akin to last decade’s oddly specific t-shirt ads is not appealing, from a brand perspective being able to quickly and cheaply produce not only a series of ads targeted at each and every demographic but also alternatives to A/B test with is a marketer’s dream.

Film one video and then have AI produce modified versions for different ages, locales and cultural norms – a little dystopian? Probably. Inevitable? Absolutely.

Sure, we’re (maybe) unlikely to see this footage used in the Super Bowl, but as YouTube ads? Absolutely. And really, I’m not entirely averse to receiving YouTube ads that fit with my interests as well as Instagram ads tend to.

It’s not just preplanned creative that will benefit here either – social media teams seeking to produce responsive ads that match the calibre of Oreo’s famous 2013 tweet in response to a power cut at the Super Bowl now have a tool that can produce high-production quality video in minutes.

Apple and AI

Media speculation has been swirling around the idea that Apple is lagging behind in the AI race.

The headline advances in AI over the past couple of years have been generative technologies, and if you’re comparing chatbots then, yes, Siri looks like ancient technology. Apple’s approach to AI though has been to infuse it throughout iOS, with the goal of incrementally improving the user experience throughout without highlighting to users just how much of the heavy lifting the AI is actually doing.

This is consistent with the Steve Jobs playbook: “You’ve got to start with customer experience and work backwards to the technology. You can’t start with the technology.” Apple aren’t focused on making AI the centrepiece of their product launches and in fact have previously gone out of their way to avoid using the term, instead preferring the use of “Machine Learning” to describe the technologies that underpin the experience, but if you look around iOS it is omnipresent.

Screenshots of AI in use on an iPhone

Features that rely on Artificial Intelligence have been integrated throughout iOS: Cameras produce a quality of image far beyond what is achievable solely with the tiny optics fitted to the phone. Messaging has translation baked in. And swiping up in Photos gives options to look up the Artwork (or Landmark, Plant or Pet) shown in the image.

The approach that Apple takes is quite different from that of other manufacturers – these improvements aren’t flagship features, but relatively subtle improvements to the overall product experience that work towards that overarching goal of improving the customer experience.

Generative Hype

On Thursday, during an earnings call, Tim Cook addressed the question of AI directly, stating:

As we look ahead, we will continue to invest in these and other technologies that will shape the future. That includes artificial intelligence where we continue to spend a tremendous amount of time and effort, and we’re excited to share the details of our ongoing work in that space later this year.

Let me just say that I think there’s a huge opportunity for Apple with Gen AI and AI, without getting into more details and getting out in front of myself.

This is the most direct indication that the company is looking to bring more generative AI features to iOS. What form that will take is still speculation, but we can assume that one area that will be addressed is Siri.

What is Siri?

To most, Siri is Apple’s slightly confused voice assistant.

Whilst this part of Siri desperately needs improvement in the context of the other services that users have become familiar with, I think that it’s unlikely that we’ll see Apple release a chatbot with the freedom that GPT (or Bing) had. I just can’t see a world where Apple allows Siri to tell users they are not good people.

Apple has an additional challenge in their overall approach: Their privacy-focused stance has seen more and more of their ML tasks performed on-device, and Apple’s long-term investment in the dedicated Neural Engine core (first introduced in 2017) has demonstrated that their strategy is very much focused on doing as much as possible without leaving the device. This results in some limitations in both the size and quality of the model that underpins Siri – what ChatGPT achieves running in a Microsoft data centre, Siri needs to within the phone.

The slightly lacklustre voice assistant isn’t Siri though. Voice is simply one of the interfaces that allows users to interact with Siri. Look for Siri elsewhere and you will start to see that Apple considers Siri to be almost everything that is powered by AI.

When widgets were introduced in iOS 14, Apple included one particular widget which I think hints at the direction of Apple’s eventual longer-term AI integration: Siri Suggestions.

The widget is actually two types of widget; a curated selection of 8 apps that change based on context and location, and a set of suggested actions based on what Siri anticipates you will want to do within apps, again based on your context. Whilst I think both are brilliant, and I use both types on my own home screen, it is the second that I think gives the best indication of where Apple’s AI strategy is heading.

Screenshots of Siri Suggestions

Apple provides the ability for apps to expose “Activities” to the wider operating system. Whether that is sending a message to a specific friend, booking a taxi or playing the next episode of a show, each activity is available to the widget without needing to go into the app to find it.

Within the widget, Siri then presents what it thinks are the most relevant activities for a given time and place. Arrive at a restaurant at 8pm and don’t look at your phone for two hours? Don’t be surprised if the top suggestion when you do is to book a taxi back home. Usually call your parents at 7pm on a Sunday? Expect a prompt to call them to appear. The ability for Siri to combine the contextual clues that the operating system has, its data on your historical patterns and the activities available to it allows it to have the unique ability to accurately predict what you want to do.

The most notable element is that the focus here is on actions using apps rather than on the app itself. This returns us to the primary driver of good user experiences; helping the user to achieve their desired outcome in the easiest possible way.

Actions, Activities and App Clips

Given that many people have spent the past decade using a smartphone, it is not uncommon to have hundreds of apps installed, most used extremely rarely. I, for some reason, still have an app for Parc Asterix installed despite last visiting for just one day nearly 4 years ago.

We’re moving away from the days of “There’s an app for that” and into the days of “Why do I have to download an app to do that?”. Apple’s solution, introduced in 2020, is App Clips.

App Clips are a way for developers to provide access to features of their app to a user without having them download the full app. They’re often contextual – a restaurant could provide access to their menu, ordering and payment via an App Clip via a code or NFC tag on a table. In Apple’s words: “App Clips can elevate quick and focused experiences for specific tasks, the moment your customer needs them.”

Whilst I’ve rarely seen App Clips used in the wild, I sense that this is another example of Apple moving pieces into place as part of their future strategy.

Fewer Apps, more Actions

By encouraging developers to provide access to specific actions or features within an installed app or via an App Clip, Apple has created an environment for their devices in which Siri can provide users with the correct action based on context, potentially without users even needing to have the app installed.

As Siri’s features become more powerful, I predict that Apple will start to leverage the actions more and more, potentially even handing off the interaction entirely to Siri.

Concept screenshots of how Parc Asterix might exist if not as an app

Take the Parc Asterix app for example – my ideal user experience is that my phone knows when I’ve arrived, checks my emails for any tickets that I already have and presents the option to buy them natively (no downloading apps) when I don’t. When I’m inside the park, I want it to provide me with easy access to an App Clip which contains a map and ride waiting times. But then I want to be able to leave and not have yet another app that won’t be used for years.

Apple’s headstart

It’s easy to point at Siri’s chat functionality and suggest that Apple is falling behind, but I think the reality is quite different. Apple has spent almost a decade building AI tools that seamlessly integrate with the operating system. They have the greatest level of access to information possible because, for most people, our lives live within our phones. I want to see Apple leveraging that and integrating AI throughout the OS to work for me and make my life that much easier.

Where the likes of rabbit have been working on Large Action Models for the past year, Apple has been at it for a decade.

I do hope that Siri’s chat functionality gets a lift this year, but I don’t think that should be the focus, I want a device that anticipates me and understands what I need to make my life easier. Apple, more than anyone, is able to deliver on that.

How hard is being an assistant anyway?

A year ago, Ben Thompson made it clear that he considers AI to be an entirely new epoch in technology. One of the coolest things about new epochs is that people try out new ideas without looking silly. No one knows exactly what the new paradigm is going to look like, so everything is fair game.

The devices we carry every day have pretty much not changed for 15 years now. Ask a tweenager what a phone looks like and in their mind, it has likely always been a flat slab of plastic, metal and glass.

There are attempts to bring new AI powered devices that are accessories to phones to the fore – I wrote about the Ray-Ban Meta glasses last year – but it has taken until the past couple of months to see devices emerge that are clearly only possible within this new era.

Enter the Humane Ai Pin and the rabbit r1.

An illustration of the rabbit r1 and the humane Ai Pin devices

Both are standalone devices. Both have a camera and relatively few buttons. Both sport very basic screens (one a laser projector!⚡️) for displaying information to the user rather than for interaction. And both are controlled, primarily, by the user’s voice interactions with an AI assistant.

In theory, these are devices that you can ask to perform just about any task and they’ll just figure out how to get it done.

As a user, this sounds like the ideal scenario, the ultimate user experience. Issue a simple request and have it fulfilled without further interaction. The dream, like an ever-present perfect human assistant.

The Perfect Assistant

But let’s put aside the technology for a moment and figure out what we would expect of that perfect human assistant.

For the sake of this thought experiment they are invisible, always there and entirely trustworthy. Because we trust them, we would give them access to absolutely everything; email, calendar, messages, bank accounts… nothing is off limits. Why? Because they will be more effective if they have all of the same information and tools that we do.

So, with all of that at their disposal, the assistant should be able to solve tasks with the context of the rest of my life and their experience of previous tasks to draw on.

Simple Tasks That Are Actually Quite Complicated

We’ll start with an easy task: “Book me a taxi to my next meeting

The assistant knows where you currently are, so they know where to arrange the pickup from. And they have access to your calendar, so they know where the next meeting is, and what time it is. They can also look on Google Maps to check the traffic and make sure that you’ll be there on time. They know that you have an account with FreeNow, and prefer to take black cabs when you’re travelling for work. And so, when you ask them to book a taxi, they can do so relatively easily, and you will get exactly what you need.

A graphic detailing the parts of the simple task of booking a taxi

Exclude one of those pieces of information though, and you will not necessarily end up with the desired result. And that’s for a relatively straightforward request.

When you make the request more complex, the level of information and the variety of subtasks required becomes huge. “I need to meet with Customer X at their office next Wednesday”. If you’re in New York and your customer is in Austen, TX, there are flights to be arranged, hotels to be booked and transfers in between, not to mention the diary management that occurs.

A graphic detailing the parts of the complex task of booking a trip

These are also pretty normal requests of an assistant - things that happen the world-over every day but which are, when broken down, incredibly complex and made up of various interdependent subtasks. Each of them is important and if one of the subtasks fails, the entire task can go wrong.

The perfect assistant though, would be able to handle all of this without breaking a sweat, and we would rely on them because we trust in one of the hardest human traits to replicate: their judgement.

Enter the AI-ssistant

The inferred promise in the rabbit r1 keynote is that you will be able to say I want to go to London around 30th January - 7th February, two adults and one child with non-stop flights and a cool SUV and it will be able to plan, arrange and book the entire trip for you.

This, if it is true, is remarkable, precisely because of how complex and interlinked tasks actually are. If we remove the professional elements from the above example, the sub-tasks involved in booking a trip like this and the understanding required are still huge.

I think the r1 is a cool concept, but the hand-wavey elements of the keynote (”Confirm, confirm, confirm and it’s booked”) are alarming, precisely because those are the actually hard parts. Getting ChatGPT to spit out a travel itinerary is easy but actually having an AI that is able to follow through and execute properly on all of the required tasks is another matter.

Don’t misunderstand me, I fully believe that an AI could navigate a webpage and follow the process to select and pay for things. I can see in the keynote that the r1 has the ability to connect to Expedia and would bet that it can book a hotel on the site.

My quandary is that when I, as an actual human™, go onto Expedia to book the above trip, I’m presented with over 300 options just for the hotels. At the top hotel, there are 7 different room types with only a couple of hundred dollars in cost difference for the entire stay between the largest and the smallest. This is already complicated before I throw in personal taste.

Then once you throw in flights where the option that I’d likely choose based on my personal time preferences is actually 9th on the price-ranked list (which is still within $20 of the cheapest option) and I just don’t see how the r1 is ever going to give me what I actually want. I know what that is and I know that a human assistant who has gotten to know my preferences and proclivities would likely make the same choices as I would, but that’s because we both have that very human trait of personal judgement.

I can see how an AI who has had access to all of my past bookings may be able to detect patterns and preferences, but I also can’t see any evidence that the r1 does have that access, or ability to learn about me personally. I won’t comment on the humane pin, but I can’t see much evidence of that there either.

My feeling is that a “just” good assistant, one that is just able to follow your directions and get stuff done, is actually quite hard to replicate. Combine that with the traits of a great assistant, one that can anticipate your needs, with good judgement and potentially even solve problems before you ask and we’re just at another level of complexity.

It’s not that I’m bearish on AI assistants as a whole but I do think that the role of being an assistant is much more complex and personal than people imagine. I can’t wait to see where we end up with daily AIs that we interact with but I can’t help but feel that an assistant in this manner just isn’t it. Yes, help me sort through the cruft of those 300 hotels, but I don’t think I’ll trust an AI to make the call for me in the same way as I would a human, at least not anytime soon.

Sebastifact: A fact machine for 7 year olds

My son is 7, an age where he is becoming interested in what I do at work. He ‘gets’ the idea of apps and websites, but I wanted to put together a very simple project that we could build together so that he could see how to take an idea and turn it into a real “thing”.

We brainstormed some ideas – he loves writing lists of facts and finding pictures to go with them with the ambition of building an encyclopaedia, so we started work on a simple website that he could type the name of a historical person into which would then return a set of 10 facts to him.

His design goal was pretty simple - the website should be yellow. I decided that it was probably worth sticking to focussing on the functionality, so yellow it is.

As is usually the case, the backend is where the action is. I wasn’t sure how to explain to him just how complex this site would have been to make just 2 years ago - the idea of entering almost any historical figure into a website and having it simply provide 10 facts back would have been a hugely laborious process with many years of contributions and yet, here we are in 2023 and the solution is “just plug a Large Language Model into it”. This results in a pretty easy introductory project. We talked a little about how an API works and how computers can talk to each other and give each other instructions, and then set to work writing the prompt that we wanted to use.

As we tested it, he started to ask if it would work for animals too. And then mythical beasts. And then countries. Seeing him working through the ideas and realising that he could widen the scope was great. This is the eventual prompt we settled on:

You are an assistant to a family. Please provide responses in HTML. The User will provide a Historical Person, Country, Animal or Mythical Beast. Please provide 10 facts appropriate for a 7 year old. If the user provided name is not a Historical Person who is dead, a Country, an Animal or Mythical Beast please respond with an error message and do not respond with facts.

By asking the LLM to provide responses in HTML we offloaded the task of formatting the output and GPT-3.5 Turbo is pretty good at providing actual HTML. I haven’t seen any issues with it yet. By instructing it to make sure the facts are appropriate for a 7-year-old, the tone of the facts changed and we got facts that were (surprise) actually interesting for him without being too pointed in their accuracy.

The response takes a few seconds to come back so I implemented caching on the requests - the most popular searches appear instantly. Ideally in the future, I’ll give all of the results URLs.

As a final bonus, I plugged in the Unsplash API to return images for him. It doesn’t always work (Unsplash apparently has relatively few pictures of Plato) but for most searches, it provides a suitable image. I might consider changing that to use the Dall-e API, but I think for now this is good enough.

There were two takeaways from this. Working with a 7-year-old is a test in scope creep. I wanted to keep this to an afternoon activity so that it would keep his interest but of course, it could have been a much larger site if we had incorporated all of the ideas. Giving him something that he had actually built was the important goal for me and keeping the scope to something achievable whilst also feeling like something that is his own was the most important thing. The second is something that I hammer on about all the time: an LLM is a massive toolbox that can help users achieve almost anything, but there is great value in providing a User Interface that allows a user to achieve a very specific task. There is a good reason why there are so many kitchen gadgets that can basically be replaced with a single knife - the user experience of using a dedicated tool that requires less skill is better.

The site is at https://www.sebastifact.com

Ray-Ban Meta Glasses: Truly Wearable AI?

I’ve been excited to get my hands on the new Ray-Ban Meta Glasses, and picked up a pair yesterday.

An illustration of a Robot wearing Ray Ban Meta Sunglasses
Me, wearing my new glasses

The most intriguing aspect of the glasses for me is the prospect of mixed-mode AI without taking my phone out of my pocket. Meta won’t release this until probably next year, but I do have some observations on how we could get there slightly sooner.

Open AI released their multi-modal version of GPT-Chat about a month ago, which means that you can now speak to Chat GPT (an oddly stilted style of conversation which is still quite compelling, I wrote about it here) and send it images which it can interpret and tell you about.

One of the cool features that Open AI included in the voice chat version is that on iOS the conversation is treated as a “Live Activity” – that means that you can continue the conversation whilst the phone is locked or you are using other apps.

What this also means is that the Ray-Ban Metas do have an AI that you can talk to, in as much as any Bluetooth headphones connected to an iPhone can be used to talk to the ChatGPT bot whilst your phone is in your pocket. I’ve looked at options to have this trigger via an automation and shortcut when the glasses connect to my phone but ultimately don’t think that is very useful - I don’t want an AI listening all the time, I want to be able to trigger it when I want it. It did lead me to add an “Ask AI” shortcut to my home screen which immediately starts a voice conversation with ChatGPT which I suppose will help me to understand how useful a voice assistant actually is over time. I also had high hopes that using “Hey Siri” would be able to trigger the shortcut, which it can, but not when the phone is locked. So close and yet so far.

As I said above though, this feature is also something that all headphones can be used for. The grail, and ultimate reason for getting the Ray-Bans, is in letting the AI see what you can see. Given that this feature won’t be officially released until probably next year, what options do we have?

The solution may come in the form of another Meta product, WhatsApp. I built a simple WhatsApp bot earlier this year which allows me to conduct an ongoing conversation with the API version of GPT-4, it’s quite rudimentary but does the job. The cool thing about the decision to deeply integrate the Meta glasses with other Meta products is that you can send messages and photos on WhatsApp directly from the glasses without opening your phone. The glasses will also read out incoming messages to you. This works pretty well with the bot that I’ve built; I can send messages using the glasses and they will read back the responses. I can say to the glasses “Hey Meta, send a photo to MyBot on Whatsapp” and it will take a snap and send it straight away. The GPT-4V(ision) API hasn’t been released yet, but once it has been, then I’ll be able to send the pictures to the bot via WhatsApp and the glasses will be able to read back the response.

This all feels pretty convoluted though and is ultimately an attempt to hack my way around the lack of available functionality. The Meta Glasses are quite cool but they aren’t wearable AI. Yet.

As with many things within the space at the moment, the technology feels tantalisingly close but not quite there. The last time anything felt quite like this to me though was the dawn of the smartphone era. Playing with the glasses has made me oddly reminiscent of playing with the accelerometer in the Nokia N95… if we’re at that point with AI then the iPhone moment is just around the corner.

Short thoughts on chatting with an AI

Open AI released multi-modal AI a couple of weeks ago and it has been slowly making its way into the ChatGPT app. It is quite disconcertingly brilliant.

An illustration of a Robot and a Human having a conversation.
An accurate depiction of the process

Conversation is a funny thing. Reading podcast transcripts can be quite nightmarish - we don’t realise how much the spoken word, especially during conversations, meanders and is peppered with hesitation, deviation and repetition until we see it written down. When we’re speaking though, these unnecessary additions make conversation human and enjoyable. When we’re listening, we often don’t realise how much we are actually playing an active role in doing so–in person, it’s the facial expressions and nods which encourage the speaker to continue; on the phone, the short acknowledgements that let a partner know you’re still there and listening. I often speak to a friend on the phone who mutes when they’re not speaking and the experience is fine, but the silence is slightly off-putting.

And so to the experience of chatting with an AI. It’s brilliant, in as much as it actually feels as though you are having something of a conversation. The responses aren’t the same as the ones you would receive by directly typing the same words into Chat GPT - they’ve clearly thought about the fact that spoken conversation is different. There is surprisingly little lag in the response. You don’t say your piece and then wait for 10 seconds for it to process; the AI responds in a couple of seconds, almost straight away once it’s heard a long enough pause. The quality of the AI is fantastic - it’s using GPT-4 which is about as state-of-the-art as it can get, and the voices, whilst not human, are surprisingly great.

However.

The entire experience is disconcerting because of how precise it is. There is no room for you to take long pauses while you think mid-sentence, or rephrase as you talk. There is absolute silence when you are talking which causes you to look down at the screen to make sure it’s still working. The responses are often long and apparently deeply thought through, but they often end with a question, rather than just being an open-ended response to work from. I’m looking forward to having an AI conversational partner, but I want it to help me tease out ideas, not necessarily give me fully formed AI thoughts on a subject. I want it to say “yes” whilst I’m speaking for no apparent reason other than to encourage me to keep talking through the idea. I want it to meander and bring in new unrelated but tangential ideas. Ultimately, I guess I want it to be a little more human.

OK Computer: 21st Century Sounds

Musical Discovery

For much of the latter half of the 20th century, new music discovery went something like this. An artist would make a song and they’d send demo tapes out to record companies and radio stations. They’d play to dimly lit bars and clubs, hoping that an A&R impresario lurked in the crowd. If they were lucky, a DJ might listen to their demo and would play it live. Perhaps someone would record it and start bootlegging tapes. These contraband tapes would be passed around and listened to by teenagers gathered in bedrooms. If all went well, the artist’s popularity would grow. They’d be signed, played more on the radio and do bigger shows. Fans and soon-to-be fans would go to record stores to listen to the new releases and buy the music on vinyl, tape or CD. The record shops would make money, the musicians would make money, the record companies would make more money.

An illustration of a Robot DJing.
R2-DJ

This began to shift in the early naughties, driven, as so much was, by the emergence of the internet. The newfound ability to rip CDs and transform tracks into easily shareable mp3s on the likes of Napster rendered the entire world’s music repertoire available gratis to eager ears. For those that preferred their music to come without the lawbreaking, the iTunes store and others made purchasing it just about as easy. MP3 players and the iPod made it effortless to carry 1000 songs in your pocket. The days of physically owning your music were all but over in the space of only a few short years.

Despite the music industry’s hope that killing Napster would stem the rising tide, the death of the platform only resulted in more alternatives appearing. It turned out that people liked having instant access to all music for pretty much free. Music discovery underwent a transformation. To acquire a song, one simply had to search for it, and within minutes, it was yours—provided Kazaa or uTorrent were operational and your parents didn’t pick up the phone and break the connection. Online forums teemed with enthusiasts discussing new musical revelations and leaks, offering nearly everything and anything you desired, all for free.

Music was no longer scarce; there were effectively infinite copies of every single song in the world to which anyone could have immediate access. Gone were the days of friends passing around tapes or lingering in record stores. The social aspect of music discovery shifted from smoky bars, intimate bedrooms, and record emporiums to the virtual amphitheaters of online forums, Facebook threads, and text message exchanges.

The big problem with all of this of course was that it was all quite illegal.

In 2006, Daniel Ek was working at uTorrent. He could see that users wouldn’t stop pirating music and that the attempts by the Music Industry to thwart sharing were doomed to failure. He “realized that you can never legislate away from piracy. Laws can definitely help, but it doesn’t take away the problem. The only way to solve the problem was to create a service that was better than piracy and at the same time compensates the music industry – that gave us Spotify.”

Musical Curation

Spotify launched with the simple headline: A world of music. Instant, simple and free.

By 2023 it has over 500 million users.

For many music fans, playlists took center stage, with enthusiasts investing hours trawling the internet for them. Spotify introduced a feature to see what your friends were listening to via Facebook and then to directly share playlists with others. Then they made playlists searchable. That killed off sites like sharemyplaylist, but meant that when I needed three hours of Persian Wedding songs, all I had to do was hit the search bar and appear to be intimately familiar with the works of Hayedeh and Dariush for my then soon-to-be in-laws.

In 2015 Spotify launched Discover, a dynamic playlist which introduced users to tracks that were similar to what the listener had played recently. It was remarkably good. The social aspect of music discovery was being lost but it was replaced with an automaton that did the job exceptionally well, even if the results were sometimes corrupted by the false signal of repeated plays of Baby Shark.

What was more subtle about what had been happening throughout this period was that the way people consumed music was changing. We had progressed from music discovery as a purposeful act to one in which it was an every day occurrence. Background music had always existed, but it was via the radio, or compilations. This was personal. The value of the music itself transformed. The ability to have a consistent soundtrack playing, at home, at work, in the car or as you made your way through every day life, meant that listeners weren’t necessarily concerned about the specific artists that were playing, they had become more interested in the general ambience of that ever-present background music. Listeners still certainly relished the release of the new Taylor Swift album, but they also listened to music that they didn’t know more easily and without ever inquiring as to who the artist was, simply because it fit within the soundtrack of their lives.

The Discover feature was one of Spotify’s first public forays into personalised music curation using machine learning. The success of the project led to more experiments. It turned out that people loved the feature.

Spotify in 2023 is remarkable. When I want to run to rock music, the tempo and enthusiasm of the suggested playlist is exactly right. When I want to focus on work, the beats are mellow and unobtrusive. The playlists “picked for me” change daily, powered by AI. I still create my own playlists, but the experience is now akin to using ChatGPT. I add a few songs to set a general mood and Spotify offers up suggestions that match the general vibe. Prior to a recent trip, I created a playlist called “Italy Background Music”, which Spotify duly filled with tracks I wouldn’t have had the first idea about where to find. They were exactly what I was looking for.

Curation and general discovery, it seems, have been broadly solved by Spotify.

Musical Creation

I’ve become accustomed to hearing tracks that I’ve never before heard and wouldn’t have the first idea about the artists of. Occasionally, I’ve tapped through on an unknown song and discovered that it has only had a couple of hundred thousand plays, ever. Spotify is clearly drawing on the entire breadth of artists within its library to match my musical preferences. Or is it?

In 2016, Music Business Worldwide published an article stating:

Spotify is starting to make its own records.

Multiple cast-iron sources have informed us that, in recent months, Daniel Ek‘s company has been paying producers to create tracks within specific musical guidelines.

By introducing its own music into the (literal) mix, for which it has paid a producer a flat fee, has added to the platform under a false name and then surfaced to listeners via its AI curated playlists, the platform is solving two issues that it considers important: from a user’s perspective, more music that fits their desired soundscape is a good thing. From Spotify’s perspective, having the ability to add, say, a 3 minute track of in-house music (which they don’t need to pay royalties on) to every hour of listening means that their cost for that hour is reduced by 5%. The losers in this case are the artists, who would otherwise have earned from that three minute play.

There is nothing that says that Spotify can’t do this though, it’s their platform and even when big artists have removed their music from the platform in protest at the company’s policies or actions, they’ve quietly reappeared within months, such is the value of the service1.

In the above 2016 article, it is clear that the firm was paying actual producers to make the music. In 2023, the landscape is likely quite different. AI has advanced to such a point where the beats, melodies and riffs of jazz, trip hop and other non-vocal music can be quite easily produced by a well trained AI. There are dozens of sites that algorithmically generate lo-fi background music. If Spotify isn’t already adding tracks generated by AI that perfectly match a given vibe, especially within those non-vocal musical genres, then it is at least experimenting with it. The prize is too large to not. In 2021, the company paid more than $7bn to rights holders. At 5%, that’s a nice $350m to find down the back of the AI sofa.

Licensing

Where this leads in my mind is something sort of entirely new.

Whilst vocal-less (can we really call it instrumental?) music is the easiest use of the technology, we saw earlier this year as there was a brief explosion of AI generated tracks from creators using AI voice models that imitated the likes of Drake and Kanye. Whilst these tracks weren’t perfect, they showed an early preview of technology that will change the face of music. The Hugging Face community is full of models of popular artists which can replicate the sound of a given singer or rapper and it is evident improvements move at a rapid clip, with some now indistinguishable from the original artist.

Licensing of brands exists broadly in most other industries. In fashion it saved Ralph Lauren (although nearly killed Burberry). It famously turned Nike from pure sportswear to casual fashion mega-brand. Could we see the emergence of the artist as a brand? The potential for artists to either directly license their musical likeness to a given platform or to allow producers to use an authorised AI model of their voice to create tracks which they, or their team, would have final sign off on could allow vocalists to extend their reach drastically.

Whilst the last idea might sound fanciful, there are artists who already draw on the online community. One of the DJ/Producer Deadmau5’s biggest tracks was the result of a fan sending him a demo of a vocal mix via twitter for a track that Deadmau5 had produced the previous day in a livestream.

We’ve also seen a rash of artists selling their music rights–will the future see those artists who reach the end of their careers sell their “official” AI model to allow them and their families to earn in perpetuity? It’s been proven repeatedly that those artists who adapt to the changing world are the ones that succeed, but this is something entirely new.

What seems certain however is that the music that we listen to in the coming years will be picked for us by machines and at least partially created using AI.

Footnotes

  1. There is a good argument that Taylor’s reasoning for removing her music wasn’t entirely to do with this.

It’s about to get very noisy

The internet is about to get a lot noisier, thanks to the rise of Large Language Models (LLMs) like GPT.

An illustration of a Robot Monkeys typing on typewriters

The Creators of The Internet

40 million people sounds like a lot of people. It’s roughly the population of Canada. And in 1995, it was the population of the entire internet. But today, today the population of the internet is about 5.2 billion. That is quite a lot more people. Most of those people are consumers. They use the internet without adding anything to it; this is fine, it is good in fact. They watch Netflix, chat on WhatsApp and scroll on TikTok. They might tweet on Twitter but, since the great Elon-ification of that place, it appears that even fewer of them are even doing that. These people used to be called ‘lurkers’ on forums. What they aren’t doing is creating websites or trying to get you to buy something or writing blogs like this one. For our purposes we can say that these people are being very quiet indeed.

But then there are a much smaller group of people who, like me, see the internet as the place that they can make a bit of noise and generally conduct some form of creativity, be it in running a business or creating reels for consumers to scroll or cobbling their thoughts into a 1000 word article. This is also a good thing - consumers can’t consume if no one is creating stuff for them to consume.

Signal vs Noise

Most of the stuff that is being created is pretty average; it’s just background noise, a light hum. A few people might consume it, but it’s receiving 27 likes on Instagram and isn’t really getting in anyone’s way. It’s the background music that plays in shops and hotel lobbies that fills the uncomfortable silence. Always there, but quite pleasant and you’d probably miss it if it wasn’t there.

Then we have the good stuff - these are the pieces that go viral: the think piece that nails a topic so absolutely, the tiktok that gets a bajillion views, the hot take that turns out to be absolute fire, the stack overflow answer that comes like an angel of the night to salve your programming pain. This is the sweet melody of signal. We like signal, and we spend most of our consuming time wading through the mire of noise looking for it. Sure the background noise is a bit irritating, but it’s relatively easy to find the good stuff once you know where and how to look for it. Google is pretty good at finding what we want. We’ve built aggregators like reddit or hacker news, where people can come and say “Hey everyone! Look, I found some gold!” and then other people can upvote it if it’s actually gold, or downvote it to oblivion if it’s not actually good.

All of this seems sort of fine. Every year more noise is created, but so too is signal. The good thing about this is that an individual creator could only create so much noise. Automated content was pretty obviously automated content, and even if Google didn’t manage to filter it out for you first, once you started reading it, it was pretty clear that no human had set eyes on it during the creation process. We become quickly attuned to that particular lack of harmony and click the back button, which also signalled to Google that the content wasn’t actually helpful, meaning the next person searching for that particular query would be less likely to end up seeing that page.

The problem we’re facing now though, is that creators have been given a whole new kind of instrument to play. It’s not that this instrument is louder or more dominant that the others, it’s that it can create an almost infinite amount of songs with barely any human input at all. And they all sound pretty great. LLMs are really good at creating noise. It’s not just that they can create an ungodly amount of content (basically for free) but differentiating that content from human generated content is, by design, almost impossible. Where a content creator could once have put out a few decent quality articles a day, they can now put out thousands. The ratio of noise to signal is about to dramatically shift.

Webinars and the Evolution of Playbox

In early 2020, as the world adjusted to a new normal, I found myself inundated with invitations to webinars. Despite their promise of convenience and flexibility, I couldn’t shake the image of low-ceilinged conference rooms at a Holiday Inn, time-share salesmen, and the lingering smell of stale coffee. My aversion to webinars, though, seemed irrational. After all, webinars could be attended from the comfort of home, with self-brewed coffee, and the ever-tempting option to click the X and leave whenever they morphed into disguised sales pitches. No accusatory stares included.

Yet, as the pandemic stretched on, attending webinars became a daily ritual. The more I participated, the clearer it became that my frustration stemmed not just from the content but from the format itself. Companies would often plan an hour-long session, with 45 minutes dedicated to presentations and a mere 15 minutes for questions at the end. The structure seemed arbitrary, with many hosts struggling to fill the time meaningfully. The low information density and the “questions at the end” format meant I often lost track of my queries or interest by the time my turn came.

Determined to find a solution, I envisioned a more efficient webinar experience. Assuming the content was relevant, my ideal setup included:

  • A searchable transcript, complete with images of key slides
  • Bullet points highlighting the top takeaways
  • The ability to watch at 1.5x speed or higher
  • An option to listen to the webinar as a podcast

Achieving this, however, posed a significant challenge for production teams, requiring considerable effort to create transcripts, summarize content, and upload podcasts. Enter GPT-3. With its release, I saw an opportunity to transform how webinars were consumed.

This marked the birth of Playbox (née Foyer), driven by my vision that any input should become any output, allowing users to consume content in their preferred format.

We utilized rev.ai to achieve 98% accurate transcripts, developed a sophisticated transcript editor linking words to video timestamps, and allowed users to insert video frames to display slides. GPT-3 helped generate summaries and key takeaways, enabling easy creation of social media snippets and videos and podcast-ready audio tracks with chapters.

However, understanding the motivations behind companies hosting webinars proved complex. Typically part of a broader marketing strategy, webinars often aimed to gather potential customer details. While we addressed this by allowing content to be gated behind email collection forms, the core issue was that many companies used webinars not just to inform but to engage potential customers post-event. By enabling users to skip the live experience, Playbox was seen as disrupting this lead generation flow, despite evidence that our approach led to a gradual increase in leads over time.

In hindsight, the timing for Playbox was off. During a period when webinars were a primary touchpoint, offering a tool that encouraged asynchronous interaction conflicted with companies’ immediate needs. Yet, I remain convinced that a consumer-centric approach to content consumption is the future. Companies that embrace this shift, making their content accessible in all formats, will ultimately gain more control and foster deeper connections with their audience.

While Playbox’s initial reception highlighted the challenges of disrupting entrenched practices, the principles behind it hold promise. As technology and consumer preferences evolve, the idea of flexible, on-demand content consumption, in the format that they want to consume, will become increasingly relevant. Companies that adapt to this new landscape will not only survive but thrive, meeting their audience on their terms and fostering lasting engagement.

Welcome to Technical Chops

One of the most exciting elements of working within the tech industry is that there is always lot of ✨new✨.

New ideas, new products, new words, new technologies. New things are good. New things are shiny. But which new things are going to change the world? Which technologies are going to fundamentally change how we interact with the world? How do we differentiate between hype and practical utility? And what does this all mean for the vast majority of the world?

As a non technical person, and even as a technical person, it is increasingly difficult to differentiate between the simply new and shiny and the new, shiny and potentially world changing. I’m looking forward to exploring how new technologies could change the world, and what it means for our next rotation.

Over the past few years I’ve been paying particular attention to:

  • Open Source Software and its wider adoption and integration
  • Artificial Intelligence
  • The evolution of software development and the speed with which the hard becomes easy
  • The deep integration of technology into every facet of our lives

I’m committing to one article a week exploring these topics and more. Follow along here or at @technicalchops on Twitter.