Welcome to Archived - the chronicles of a VC attempting to decode innovation in technologies, markets and business models. If you enjoy this post, you can subscribeđ or catch me at ruth@frontline.vc for a chat!
The term Foundation Model was coined by the Stanford Center for Research on Foundation Models (CRFM) to encapsulate the pivotal (and unfinished) role these AI models (eg. GPT-3, DALL-E) are anticipated to play across a broad range of applications in the coming years. If foundation models truly are commensurate to major technological paradigm shifts like Mobile and Cloud, itâs worth taking a step back to gain a clear understanding of how these things work under the hood.Â
Stanford defines foundation models as:
Models trained on broad data (generally using self-supervision at scale) that can be adapted (fine-tuned) to a wide range of downstream tasks
The <b> words are the building blocks that underpin foundation models (FMs). Weâll step through each of these terms to decode the complexity of this technology.
MODELS
âModelsâ in this context refers to Deep Neural Networks (DNNs) and more specifically, a type of DNN called a Transformer. First, Iâm going to explain how a plain vanilla DNN works and then Iâll tackle the specifics of the Transformer architecture.
âPLAIN VANILLAâ FEED-FORWARD NEURAL NETWORKÂ (FFNN)
Neural networks broadly mimic the structure of the brain - they have neurons, connectors, and like onions & ogres (iykyk), they have layers! The best way to understand neural nets is through an exampleâââso letâs take the canonical one of a network that learns to recognize hand-written digits. Shout out to 3blue1brown for doing most of the heavy lifting here!Â
First up, letâs look at the structure of the network:
Input Image: The network is trained on a bunch of hand-written digits. The MNIST database provides a collection of 60k+ examples of digits from 0 â 9, making the ideal training data for this use case. Each image is made up of 784 pixels (28x28) and each of these pixels becomes a neuron in the input layer of the network.
The Input Layer: Think of a neuron as âa thing that holds a numberâ between 0 â 1. The closer the number is to 1, the more âactiveâ the neuron. You might ask, well, how do these numbers get assigned? In this example, the number represents the greyscale value of the pixel, so a black pixel = 0, a white pixel = 1 and so on.
The Hidden Layers: If each neuron in layer 1 represents a pixel, think of each neuron in layer 2 as representing the edge of a number, and each neuron in layer 3 as representing a line or loop of a number. Activations in one layer determine activations in the next through the dark arts of weights (aka parameters) and biases. Each connector (ie. the grey lines) in the network is assigned a weight. Think of the weight as the strength of the connectionâ.
The Output Layer: The final layer of the network consists of 10 neurons each representing a single digit. The most active neuron in this last layer is basically the networkâs choice of what the input image actually is (in this case, a 3).
Next up, the âlearningâ
Learning = Finding the right weights & biases that minimize the cost function
So how does the network actually âlearnâ? At the outset of training, weights & biases are just randomly assigned numbers. As you can imagine, the network performs pretty poorly as a result. This is where the cost function comes into play. The cost function is the difference between the networkâs prediction, and the choice we expect (ie. a 3). The âcostâ measures how badly the network is performingâââso naturally we want to minimize this cost to make the network useful. Minimizing the cost is done using Gradient Descentâââfor now, just know that this is an algorithm that helps you find a âlocal minimumâ (weâll tackle this another time!).
***Side Note: I find the concept of deep learning easier to comprehend through the lens of computer vision, which is why I used the image recognition example above. Foundation models however, have primarily taken shape thus far in NLP so Iâll focus the rest of the post on language use cases.***
EVOLUTION OF NLP
Although FFNNs perform well on pattern recognition problems (such as spam detection or number recognition), their lack of âmemoryâ about the inputs they receive means they perform poorly on any type of sequential data (eg. time series, speech, text).Â
Enter Recurrent Neural Networks (RNNs)!Â
The most important thing to remember about RNNs is that they process data sequentially - words are fed in one at a time and the network holds information about prior words in memory.Â
Although this structure enabled progress in NLP, itâs impact was limited due to two key constraints:
Short-Term Memory: Similar to Dory from Finding Nemo, RNNs have short-term memoryâââthey tend to forget the beginning of a sentence by the time they reach the end. In the above example, the RNN has essentially forgotten the words âwhatâ and âtimeâ as it approaches the end of the sentence. It must try guess the next word based on the context âis itââââthis is a challenging task!
Parallelization Issues: RNNs sequential architecture requires the network to compute the memory of the word âwhatâ before it can encode the word âtimeâ. The challenge here is one of hardwareâââRNNs donât parallelize well (ie. you canât run them on a million servers at once so theyâre slow to train).
TRANSFORMERS
Transformers entered the scene in 2017. This new type of neural network architecture, proposed via Googleâs âAttention is all you Needâ paper, quite literally changed the game. Transformers brought two key innovations from its predecessor (RNNs):
Positional Encodings: Transformers feed all input words into the network at once. This solves the parallelization problem but raises the question of how the network will remember word order. This is done via positional encodings. Put simply, before feeding a bunch of text into the network, you stick a number on each word embeddingâââin this way, word order is stored in the data itself.Â
Self-Attention: Transformers have the ability to pay attention to the most important parts of an input, in the same way we as humans do! For example, when you read the sentence below, you understand that weâre not referring to a bank containing $$, rather the bank of a river. Self-attention enables machines to understand the correlation between similar words and come to the same conclusion.Â
But again, how does this learning actually occur? Letâs walk through it using this sentence as an example:
Step 1âââCreate Word Embeddings: Neural Networks donât understand text, but they do understand numbers. Word embeddings are a means of converting words into vectors (ie. an array of numbers) such that similar words are close to each other in the embedding space. Words are similar if they tend to be used in similar contexts (eg. king and queen are similar as they are both used around other words such as crown, royalty etc).Â
Step 2âââAdd Positional Context: As we explained above, we then need to slap a number on each vector so that the system can remember the wordâs position in the sentence.
Step 3âââExtract Features with High Attention: In keeping with the culture of the ML ecosystem, there are three cryptic terms used to describe the basis of how the system learns what to pay attention toâââquery, key and value. The best analogy for this trio is to think of a retrieval system like YouTube. When you search something on YouTube, the engine maps your input in the search bar (ie. the query) against a set of dimensions such as video titles (ie. keys) and presents you with the best matched videos (ie. values). Applying this to our example, weâre trying to add more context to âbanksâ word embedding. To do this, we map bank (ie. the query) against all the other words in the sentence (ie. the keys) using some math that we wonât get into [dot product, scaling, normalizing]. The resulting outcome is a heat map, where âsimilar wordsâ have higher weightings/attention.Â
Ok, so now that weâve established the architecture of a foundation model, letâs move onto how these things are trainedâ via broad data and self-supervision.
SELF-SUPERVISION ON BROAD DATA
FMs are fed data, and therefore learn from, a little thing we call the world wide webâââthis includes everything from Wikipedia and Online Books to Reddit etc (CC stands for common crawl).Â
FMs are generally trained in a self-supervised way, meaning that labels are generated from the input data itself, not from human annotators. In the example below, âI do not like green ____ and hamâ becomes the input (x) to the network and âeggsâ, the networkâs prediction, becomes the label (y). This contrasts to supervised learning where each word is manually labelled by a human based on category (eg. green = color, ham = food) or grammatical form (eg. noun, adjective).
Supervised learning has dominated over the past decade with entire industries popping up to help with the gargantuan task of labelling the worlds data (Scale AI, Snorkel). However, labeling data is a costly affair and challenging to scale. Supervised models are also specialists. They get an A+ for the specific job they were trained to do but often fail on inputs that are even marginally out of context.Â
Letâs think about how children learn for a moment. Donât you think itâs strange that my 2 year old nephew can recognize a cow in pretty much any scenario having only seen a few farm animal books, yet a neural network using supervised learning needs to see thousands of examples of cows and may still struggle to identify this one based on its unusual setting?
This is because humans rely on background knowledge theyâve built up about the world through observationâââa.k.a common sense. Self-supervised learning is an attempt at replicating this phenomenon in machines. The emergence of BERT and GPT-2 in 2018/19 brought with it a resurgence of interest in self-supervised learning. The impact of this was twofold:
Scale: Models could be trained on huge unlabeled datasets thus beginning the journey of âlearning the internetâ.Â
Generalizability: Because data labels arenât explicitly provided, the model learns subtle patterns behind the data and can be more easily applied to similar downstream tasks.Â
Now we know both the architecture of a FM and how these things are trained. The last piece of the puzzle is understanding how these models are adapted to downstream tasks.Â
TRANSFER LEARNING
Transfer learning does what it says on the tinâââit involves taking the âknowledgeâ a model has learned from one task (eg. classifying cars) and transferring it to another (eg. classifying trucks). The traditional method of transfer learning is fine-tuning.
Fine-tuning makes most sense to employ when:
You have a lack of data for your target taskÂ
Low level features from the pre-trained model are helpful for learning your target taskÂ
Ok, so weâve now unpacked the technological building blocks of FMs:Â
1) Transfer Learning, 2) Self-Supervised Learning and 3) the Transformer Architecture, all of which enabled these powerful AI models to come into fruition.Â
Interestingly, none of these methods are new. In fact, theyâve been around for decades. So what was the true innovation of GPT-3 that kicked off the era of FMs?
The secret sauce was actually scale and the infrastructure that enabled this (an ode to MLOps!)
The release of GPT-3 brought with it a new phenomenonâââthe emergence of in-context learning. The term emergence means âmore is differentâ - that is, quantitative changes (ie. âmoreâ) can lead to unexpected phenomena (ie. âdifferentâ). In the case of Machine Learning, the sheer scale of GPT-3 at 175bn parameters (âmoreâ), led to the emergence of in-context learning (âdifferentâ).Â
In-context learning removes the need for a large labelled target dataset as well as the process of fine-tuning (updating the weights) which saves both time + $$. The model can be adapted to downstream tasks by providing it with a description of the task, a prompt and some examples (âshotsâ) in the case of one or few-shot learning.Â
Itâs worth taking a moment to really internalize whatâs happening hereâââ
the model can complete a task that itâs never been explicitly trained onÂ
At first blush, this seems pretty magicalâââand to some extent, it is. Unpacking how this occurs however, helps us shed some light on this mysterious concept. What youâre basically hoping for, is that somewhere on the internet (and thus in your training data), the model has seen the word structure âtranslate German to Englishâ and ârot = redâ. A quick Google search shows that the model could have easily picked this structure up from multiple different sites whilst doing its common crawl of the web.Â
This highlights an ostensibly obvious but nevertheless important fact:
In order for a model to generalize well to a specific downstream task, data relating to that task must be included in the training dataset!Â
So if scale was the fundamental unlock for foundation models, why was there a lag between GPT-3âs release (June 2020) and the now ubiquitous âAI boomâ. The answer lies in accessibility. In recent years, there has been a steady democratization of access to state-of-the-art models enabled by a progression in the model interface. Models that were historically reserved for academia, can now be accessed by the general public via an approachable user interface.
** Technical explanation = fin! If youâve made it this far, I salute you 𫡠and hope that this post helped demystify some of the technical jargon associated with FMs!
ARE FOUNDATION MODELS REALLY âALL YOU NEEDâ?
Foundation Models have undoubtedly opened up the floodgates for ML adoption and their existence raises the question as to whether any company would want to build their own models in the future. This is an area Iâve spent some time pressure testing.
There are both offensive and defensive factors that I believe will continue to drive some Enterprises towards a âBYOMâ (build your own model) approach.
Proprietary Data: FMs have built up an immense knowledge of the internetâââthat is, all publicly available datasets. What they havenât been trained on however, is the billions of proprietary Enterprise datasets that exist behind closed doors. What about fine-tuning you might ask? As highlighted above, fine-tuning makes most sense when you have a lack of data for your domain-specific task and when the statistics of the pre-trained model are similar to your target task. For many companies however, a lack of data is certainly not the problem and often FMs wonât generalize well to domain-specific use cases.
Hallucinations: LLMs do a fantastic job of impressing (and fooling) the human âright brainâ (our creative/intuitive side) by generating content that is extremely plausible at first glance. However, we shouldnât confuse syntax proficiency with content rooted in fact. LLMs, in short, use probabilistic next-word prediction based on an ingested corpus of the internet. Although these models have an exceptionally large memory, their ability to generate untruthful content means in their current state, these systems shouldnât be deployed (off-the-shelf) in areas where there is a definitive right answer, and in particular where the risk associated with giving the wrong answer is high ââtake Sam Altmanâs word for it!
Data Provenance & Privacy: The history of Machine Learning is a story of both emergence and homogenization (ie. the consolidation of methods). Deep Learning brought about the homogenization of model architectures (Neural Nets, Transformers etc) and now FMs are introducing homogenization of the model itself. FMs are thus a double-edged swordâââon one hand, improvements lead to widespread immediate benefitsâââon the other hand, defects are inherited by all downstream models thus amplifying intrinsic biases. This issue presents a significant barrier to adoption for companies in regulated industries such as Banking, Healthcare, Legalâââwhere data provenance is critical.Â
Squeezing Margins & Latency Issues: Where FMs are implemented in customer-facing use cases, the cost of running inference through FM providers like OpenAI becomes prohibitively expensive at scale, squeezing margins of the businesses building on top of them. In a similar vein, bigger is not always better with respect to model size where latency is concerned. For example, increasing the size of a code completion model may improve performance but if itâs too slow to provide the code suggestions as the user is typing, the productâs value is questionable.Â
For these reasons, I think itâs clear that Foundation Models are not the only thing we need! They represent a critical piece of the puzzle but not the whole picture. I expect a hybrid world to emerge, where weâll see a mix of companies building their own models and leveraging FMs. As a VC, Iâm excited about companies that are lowering the barriers to entry on both sides (shout out to MosaicML, a Frontline portfolio company, thatâs leading the way on the BYOM side!). This layer of âpicks and shovelsâ within the AI stack, will unlock exponential progress and an abundance of intelligent applications in the coming years - a future Iâm undeniably excited about!
great coverage, thank you!