The Anticipated Arrival of GPT-4: Insights and Expectations

Update: GPT-4 has been released.

The launch of GPT-4 is on the horizon.

GPT-3 was unveiled in May 2020, nearly two years ago, following a one-year gap after GPT-2's release, which itself came a year after the initial GPT paper. If this pattern continues, GPT-4 should already be available. While it isn't yet, Sam Altman, CEO of OpenAI, indicated a few months back that GPT-4 is on its way, with expectations pointing towards a release in mid-2022, likely around July or August.

Although the excitement surrounding GPT-4 is palpable, specific details about its characteristics and capabilities remain scarce. Altman provided some insights during a Q&A session last year, urging discretion regarding the information shared, and while I've remained quiet for several months, it's time to speculate. One certainty he shared is that GPT-4 will not reach the 100 trillion parameters I previously suggested; such a model will take longer to materialize.

It's been a while since OpenAI disclosed any information about GPT-4. However, emerging trends in AI, especially in Natural Language Processing (NLP), may offer hints about its features. Based on Altman's insights and the success of these trends, we can make some reasonable predictions that extend beyond the outdated notion of simply increasing model size.

Here are my forecasts regarding GPT-4, informed by the information available from OpenAI and Sam Altman, alongside current trends and advancements in language AI. (I will clarify which points are conjectural and which are established.)

Model Size: GPT-4 Will Not Be Exceptionally Large

GPT-4 is not expected to be the largest language model available. Altman has suggested it will not significantly surpass GPT-3 in size. While it will still be considerably larger than earlier neural network iterations, its distinguishing factor won't be its size. It's likely to fall between GPT-3 and Gopher (175B-280B).

This decision is grounded in sound reasoning.

Last year, Nvidia and Microsoft introduced Megatron-Turing NLG, boasting 530 billion parameters and claiming the title of the largest dense neural network—three times larger than GPT-3—until Google’s PaLM surpassed it at 540 billion. Intriguingly, several smaller models developed subsequently have achieved superior performance.

The principle of "bigger is better" is being challenged.

The existence of these smaller, high-performing models indicates two significant trends.

First, companies have recognized that merely increasing model size is neither the only nor the most effective method to enhance performance. In 2020, Jared Kaplan and his team at OpenAI concluded that the most substantial performance gains occur when computational resources are primarily directed at increasing the number of parameters, adhering to a power-law relationship. This guideline has been adopted by major players in the language model sector, including Google, Nvidia, Microsoft, and DeepMind.

Despite its massive size, MT-NLG does not excel in performance metrics. It fails to lead in any specific benchmark category. Smaller models, like Gopher (280B) and Chinchilla (70B)—a mere fraction of MT-NLG’s size—outperform it across various tasks.

It’s becoming clear that model size alone does not guarantee enhanced language comprehension, leading to the second implication.

Companies are beginning to move away from the "larger is better" mentality. Increasing the number of parameters is just one of many factors that can boost performance. The associated drawbacks, such as environmental impact, computational expenses, and accessibility challenges, render this approach one of the least favorable, despite its simplicity in implementation. Organizations are likely to reconsider building enormous models when they can achieve similar or superior results with smaller architectures.

Altman noted that OpenAI is shifting its focus from merely enlarging models to maximizing the effectiveness of smaller ones. Researchers at OpenAI were early advocates of the scaling hypothesis but may now be exploring alternative methods for developing improved models.

In summary, GPT-4 will not be significantly larger than GPT-3 due to these considerations. OpenAI is likely to concentrate on other elements—such as data management, algorithm refinement, parameter optimization, and alignment strategies—that could yield substantial improvements more efficiently. The capabilities of a 100 trillion parameter model will have to wait.

Optimality: Maximizing GPT-4's Potential

Language models face a critical challenge in optimization. The high costs associated with training necessitate trade-offs between accuracy and expenditure, often resulting in models that are notably under-optimized.

GPT-3 underwent training only once, despite various errors that might have warranted a retraining process. OpenAI opted against this due to prohibitive costs, which hindered researchers from identifying the optimal hyperparameters for the model (like learning rate, batch size, and sequence length).

The hefty training expenses also limit analyses of model behaviors. When Kaplan's team concluded that model size was the most significant variable for performance enhancement, they did not consider the number of training tokens—the quantity of data fed to the models. Considering this would have demanded resources that were economically unfeasible.

Tech companies adopted Kaplan's findings since they represented the best available information. Ironically, organizations like Google, Microsoft, and Facebook "spent" millions on increasingly large models—contributing to environmental degradation in the process—motivated by economic constraints.

Currently, companies like DeepMind and OpenAI are exploring new methods. Their focus is shifting from simply enlarging models to discovering optimal configurations.

Optimal Parameterization

Recently, Microsoft and OpenAI demonstrated that GPT-3 could achieve improved performance if trained with optimal hyperparameters. They found that a 6.7 billion version of GPT-3 could match the performance of the original 13 billion model due to hyperparameter tuning—something impractical for larger models—resulting in a performance enhancement akin to doubling the parameter count.

They identified a new parameterization (?P) in which the best hyperparameters for a smaller model were also applicable to a larger one within the same family. This approach allowed them to optimize models of varying sizes at a fraction of the training cost, making it feasible to transfer hyperparameters to larger models at minimal expense.

Optimal-Compute Models

A few weeks ago, DeepMind revisited Kaplan's conclusions and discovered that, contrary to earlier beliefs, the number of training tokens significantly influences performance, akin to model size. They asserted that as more computational resources become available, they should be distributed evenly between scaling parameters and data. Their hypothesis was validated by training Chinchilla, a 70 billion model (four times smaller than Gopher, the previous state-of-the-art), using four times the data of all large language models since GPT-3 (1.4 trillion tokens, compared to the typical 300 billion).

The results were clear. Chinchilla "uniformly and significantly" outperformed Gopher, GPT-3, MT-NLG, and all other language models across numerous benchmarks, indicating that current models are both undertrained and oversized.

Given that GPT-4 is expected to be slightly larger than GPT-3, it will require around 5 trillion training tokens to be compute-optimal (following DeepMind’s findings)—significantly exceeding current datasets. The computational requirements to achieve minimal training loss would be approximately 10-20 times larger than those used for GPT-3 (using Gopher's compute budget as a reference).

Altman may have alluded to this when he mentioned in the Q&A that GPT-4 will utilize considerably more computational resources than GPT-3.

OpenAI is likely to integrate these optimality insights into GPT-4, though the extent of this implementation remains uncertain due to budget constraints. What is clear is that they will prioritize optimizing variables beyond model size. Identifying the best hyperparameters and determining the optimal-compute model size and parameter count could lead to remarkable improvements across all evaluation benchmarks. Predictions for language models will fall short if these methodologies are synthesized into a single architecture.

Altman has also suggested that people will be astounded by the capabilities of models that do not require scaling up.

Multimodality: GPT-4 Will Be Text-Focused

The future of deep learning lies in multimodal models. Human cognition is inherently multisensory, shaped by our multimodal environment. Limiting AI to one sensory input at a time hampers its ability to comprehend and navigate the world.

Nevertheless, constructing effective multimodal models poses considerable challenges compared to developing high-quality language-only or vision-only models. Integrating visual and textual data into a unified representation is a complex endeavor, and our understanding of how the human brain accomplishes this is still limited (not to mention that the deep learning community often neglects insights from cognitive science regarding brain function).

Altman mentioned during the Q&A that GPT-4 would not be multimodal (like DALL·E or MUM) but rather a text-centric model. I speculate that OpenAI is aiming to maximize the potential of language models before advancing to the next generation of multimodal AI.

Sparsity: GPT-4 Will Be a Dense Model

Recently, sparse models that utilize conditional computation to process different types of inputs have achieved notable success. These models can easily exceed the 1 trillion parameter threshold without incurring excessive computing costs, creating a seemingly inverse relationship between model size and computational expense. However, the advantages of Mixture of Experts (MoE) approaches tend to diminish in very large models.

Given OpenAI’s historical focus on dense language models, it's reasonable to anticipate that GPT-4 will also be dense. Since Altman has indicated that GPT-4 will not significantly exceed GPT-3 in size, it can be inferred that sparsity is not currently an avenue OpenAI is pursuing.

Sparsity, much like multimodality, is likely to play a more prominent role in future iterations of neural networks, as our brains—AI's primary inspiration—heavily rely on sparse processing.

Alignment: GPT-4 Will Exhibit Improved Alignment

OpenAI has dedicated significant resources to addressing the AI alignment challenge: how to ensure that language models act in accordance with human intentions and values—whatever those may be. This issue is not only mathematically complex (i.e., how can we make AI accurately understand our desires?) but also philosophically challenging (i.e., there isn't a universal approach to align AI with human values, given the vast diversity and often conflicting nature of human beliefs).

Nonetheless, they made initial strides with InstructGPT, a refined version of GPT-3 that was trained using human feedback to better follow instructions (though it doesn’t yet account for whether those instructions are beneficial or harmful).

The primary breakthrough from InstructGPT is that, regardless of its performance on language benchmarks, it is perceived as a superior model by human evaluators (who represent a relatively homogeneous group—OpenAI employees and English speakers—so we should be cautious about drawing broad conclusions). This underscores the importance of moving beyond traditional benchmarks as the sole metric for assessing AI's capabilities. Human perceptions of the models may hold equal or greater significance.

Considering Altman's and OpenAI’s commitment to developing beneficial AGI, I believe GPT-4 will incorporate and expand upon the insights gained from InstructGPT.

They are likely to enhance their alignment strategies, as the previous methods were limited to OpenAI employees and English-speaking annotators. True alignment efforts should encompass diverse groups representing various backgrounds and characteristics, including gender, race, nationality, and religion. This presents a considerable challenge, and any progress toward this goal is commendable (though we should be cautious about labeling it alignment if it doesn't resonate with a broader population).

In Conclusion…

Model Size: GPT-4 will be larger than GPT-3 but not significantly so compared to the largest models currently available (MT-NLG at 530B and PaLM at 540B). Model size will not serve as a defining characteristic.

Optimality: GPT-4 will utilize more computational resources than GPT-3. It will implement novel insights regarding optimal parameterization (identifying the best hyperparameters) and scaling laws (the number of training tokens is as crucial as model size).

Multimodality: GPT-4 will focus solely on text (not multimodal). OpenAI aims to maximize the capabilities of language models before fully transitioning to multimodal systems like DALL·E, which they anticipate will eventually outperform unimodal models.

Sparsity: Following the patterns established by GPT-2 and GPT-3, GPT-4 will be a dense model (with all parameters activated for any given input). The significance of sparsity will likely increase in future developments.

Alignment: GPT-4 will demonstrate greater alignment with human values than GPT-3. It will incorporate lessons learned from InstructGPT, which was trained using human feedback. However, achieving true AI alignment remains a complex endeavor, and efforts should be evaluated carefully to avoid overhyping progress.

Subscribe to The Algorithmic Bridge. Connecting algorithms to people. A newsletter about the AI that influences your life.

You can also support my work on Medium directly and gain unlimited access by becoming a member using my referral link here! * :)*

rhondamuse.com

The Anticipated Arrival of GPT-4: Insights and Expectations

Model Size: GPT-4 Will Not Be Exceptionally Large

Optimality: Maximizing GPT-4's Potential

Optimal Parameterization

Optimal-Compute Models

Multimodality: GPT-4 Will Be Text-Focused

Sparsity: GPT-4 Will Be a Dense Model

Alignment: GPT-4 Will Exhibit Improved Alignment

In Conclusion…

Share the page:

Recent Post:

Exploring Heaps: From Basics to Practical Applications in 30 Days

# Discover Your Path: Six Free Online Learning Platforms with Certificates

A Familiar List of Lifestyle Changes for Longevity and Health

Mastering Meditation: A Beginner's Guide to Inner Peace

Discovering My Favorite Plants to Propagate and Nurture

Exploring the World of Headphones: A Comprehensive Guide

# Navigating Business Dilemmas: Purpose, Hustle, and Automation

Shiro Ishii: The Sinister Legacy of Unit 731's Experiments