From Dice to Chatbots: The Evolution of Large Language Models

Would you believe that ChatGPT, a symbol of cutting-edge technology, originated from a casino in Monaco? Let’s follow the grand evolution of large language models (LLMs).

The probabilistic idea that sparked artificial intelligence: the principle of the Monte Carlo method
The AI winter and how multilayer perceptrons overcame it, leading to the big bang of deep learning
The transformer architecture innovation that made modern large language models possible
The future of technology beyond ChatGPT toward agent AI and the current status in Korea

Dawn of Artificial Intelligence: The Emergence of Probability and Neural Networks

In the winter of 2022, the world was captivated by the seemingly magical appearance of the AI chatbot, ChatGPT. This astonishing technology, which appeared overnight, answered our questions fluently, wrote poetry, and coded, shocking the entire globe. It felt like the future from a sci-fi movie suddenly became reality.

But was this truly a miracle achieved overnight? Or was it the fruit of decades of persistent effort we hadn’t fully recognized? To answer this, we must go back in time. Surprisingly, the journey began not in a cutting-edge computer lab but with an idea inspired by chance and probability at a casino in Monaco.

1. Casino Secrets: Taming Uncertainty with the Monte Carlo Algorithm

Why a casino when talking about AI history? At the beginning lies a uniquely named methodology called the ‘Monte Carlo algorithm.’ This name comes from Monaco’s famous gambling city, Monte Carlo, and its principle is deeply related to gambling probability games.

The core of the Monte Carlo method is ’trying many random attempts.’ When faced with problems too complex or impossible to calculate mathematically, this technique obtains approximate solutions through countless random trials.

By randomly plotting points inside a square and calculating the ratio of points inside the inscribed circle, one can approximate the value of pi (π). This is the basic principle of the Monte Carlo method. — Pi calculation using Monte Carlo method

By drawing a circle perfectly inside a square and randomly placing points, the ratio of points inside the circle approximates pi.

This idea was applied in games like chess or Go, where the number of possibilities is nearly infinite, by randomly exploring promising paths instead of calculating all possibilities to estimate the optimal move. AlphaGo, which later defeated Go champion Lee Sedol, used this idea as its core weapon called ‘Monte Carlo Tree Search (MCTS).’ This probabilistic approach to finding the most plausible answer aligns with the fundamental philosophy of large language models that predict the next word probabilistically.

2. The Birth of AI and Two Diverging Paths

In 1956, the Dartmouth Workshop introduced the term ‘Artificial Intelligence (AI),’ and AI research split into two major directions:

Symbolism: A top-down approach viewing human intelligence as the result of logical rules and symbol manipulation, aiming to program these explicitly.
Connectionism: A bottom-up approach inspired by brain structures, believing intelligence emerges from connecting numerous artificial neurons.

In 1958, psychologist Frank Rosenblatt developed the ‘Perceptron,’ the first practical artificial neural network model mimicking brain neurons. It took multiple inputs, multiplied them by weights, and activated if the sum exceeded a threshold. The innovation was that these weights could be ’learned’ from data.

3. The First AI Winter: The XOR Problem That Frustrated AI

The perceptron solved ’linearly separable’ problems like AND or OR, where a single line can separate correct and incorrect answers, fueling optimism in AI research.

However, this optimism collapsed before the very simple XOR (exclusive OR) problem. XOR is true only when two inputs differ, which cannot be separated by any single line.

AND (left) and OR (center) operations can be separated by a single line, but XOR (right) cannot. This simple problem frustrated early AI. — XOR problem not linearly separable

In 1969, Marvin Minsky mathematically proved this limitation, turning AI expectations into disappointment and triggering the ‘AI Winter,’ a period of sharply reduced investment. Personally, I think the shock of the XOR problem was like having all the ingredients and recipes but missing a crucial spice, ruining the dish. This simple problem dampened AI hopes and led to a long stagnation, teaching us that innovation can be hindered by small details rather than massive obstacles.

The Leap of Deep Learning and the Dawn of Large Language Models

After the harsh winter, AI evolved into deeper and more complex structures, seizing another chance to leap forward.

4. The Savior Ending the Dark Age: Multilayer Perceptrons and Backpropagation

The end of the first AI winter came with the idea ‘If one line doesn’t work, use many lines,’ realized in the ‘Multilayer Perceptron (MLP).’ MLP added one or more ‘hidden layers’ between input and output layers. These hidden layers nonlinearly transformed data, enabling solutions to problems like XOR that single-layer perceptrons couldn’t solve.

By adding hidden layers between input and output layers, MLP solved nonlinear problems that single-layer perceptrons could not. — Multilayer Perceptron (MLP) structure

However, training such complex networks was challenging. The breakthrough came with the ‘Backpropagation’ algorithm. Backpropagation calculates how much each connection (weight) contributed to the error by propagating the error backward from the output, enabling effective training of deep neural networks.

5. 2012, The Big Bang of Deep Learning: The Arrival of AlexNet

Though theoretical tools existed since the 1980s, deep learning’s potential exploded only with the advent of ‘big data’ and ‘GPUs.’ The 14 million-image dataset ‘ImageNet’ released in 2009 combined with GPUs’ powerful parallel computing completed the trinity.

AlexNet’s 2012 debut dramatically lowered image recognition error rates, marking the dawn of the deep learning era. By 2015, it even surpassed human error rates (~5%). — ImageNet challenge error rate trend

In 2012, Geoffrey Hinton’s team won the ImageNet challenge with the deep convolutional neural network (CNN) ‘AlexNet,’ achieving a remarkable 15.3% error rate, signaling the start of the deep learning era. This event spread the belief that ‘scale equals performance,’ foreshadowing the emergence of large language models.

6. Understanding the Flow of Time: Recurrent Neural Networks (RNN)

After conquering images, AI’s next target was sequential data like language, where ‘order’ matters. The model developed for this was the ‘Recurrent Neural Network (RNN).’ RNNs create ’loops’ inside the network to remember previous steps and incorporate that information into current calculations.

RNN processes sequential information like context by reflecting previous step information (hidden state) in current calculations through its recurrent structure. — RNN operation principle

However, RNNs suffered from the critical ’long-term dependency problem,’ forgetting earlier information as sequences grew longer. Although improved models like LSTM and GRU appeared, the fundamental limitation of sequential processing remained.

The Modern Deity: Transformers and Large Language Models

In 2017, a single paper changed everything, opening the true era of large language models.

7. “Attention Is All You Need”: The Transformer That Changed the World

Google’s 2017 paper introduced the revolutionary ‘Transformer’ architecture, completely abandoning the ‘recurrent’ structure that was the foundation of sequential data processing.

Transformers use a ‘self-attention’ mechanism to lay out all words at once and simultaneously calculate the importance of relationships each word has with every other word in the sentence.

Self-attention calculates the strength of relationships between the word “it” and other words in the sentence, identifying “animal” as the most related. (Image source: Jay Alammar) — Self-attention mechanism example

This approach fundamentally solved the long-term dependency problem and allowed all computations to be processed in parallel, maximizing GPU performance. This overwhelming efficiency opened the era of ’large’ language models previously unimaginable.

8. The Age of Giants: BERT and GPT

Based on transformers, two giant models dominating natural language processing emerged: BERT and GPT. To simplify, BERT is the detective, GPT is the storyteller.

BERT (Context Detective): Trained using the ‘masked language model’ method, filling in blanks by considering both left and right context. It excels at understanding subtle word meanings and is a core technology in Google Search.
GPT (Creative Storyteller): Trained to predict the most probable next word given previous words. This autoregressive method empowers creative text generation. GPT-3 notably demonstrated ‘few-shot learning,’ performing new tasks with just a few examples, opening possibilities for general AI.

9. The Birth of Human-like AI: ChatGPT’s Secret, RLHF

GPT-3 was impressive but sometimes lied or generated harmful content. Aligning the model with human intentions and values was necessary.

Reinforcement Learning from Human Feedback (RLHF) trains language models to align with human intentions through a three-step process. (Image source: Hugging Face) — RLHF three-step training process

The key technology that solved this and gave birth to ChatGPT is ‘Reinforcement Learning from Human Feedback (RLHF).’

Step 1 (Instruction Tuning): Teach the model to follow user instructions using an ‘instruction-answer’ dataset.
Step 2 (Reward Model Training): Humans rank multiple answers, and the model learns to score answers, creating a ‘judge AI.’
Step 3 (Reinforcement Learning): The step 1 model generates answers, the step 2 judge scores them, and the model adjusts itself to achieve higher scores.

Recently, more efficient techniques like ‘Direct Preference Optimization (DPO)’ that simplify this process have gained attention.

Status and Future of Large Language Models in South Korea

Amid the LLM race triggered by ChatGPT, Korean companies fiercely compete to secure ‘sovereign AI’ deeply understanding Korean language and culture.

Comparison of Leading Korean LLMs

Developer	Model Name	Key Features
Naver	HyperCLOVA X	Massive Naver data base, Korean language specialization, ‘Thinking’ feature, integration with own services (search, shopping, etc.)
Kakao	Koala (formerly KoGPT)	Open source (commercial use allowed), lightweight and efficient, excellent Korean performance, multimodal support
SKT	A.X (A.Dot.X)	Developed ‘from scratch,’ multimodal (VLM), high-performance document encoder, telecom specialization
LG AI Research	EXAONE	Expert AI, hybrid of inference and generation, specialized in math/coding/science
Upstage	SOLAR	Lightweight model (SLM) with top-level performance, high efficiency and cost-effectiveness, global leaderboard #1

In this fierce competition, the ‘Open Ko-LLM Leaderboard’ led by Upstage serves as a standard benchmark objectively comparing domestic models’ performance, contributing to the development of Korea’s AI ecosystem.

Conclusion: The Journey Toward Agent AI and Our Challenges

The AI journey that began with dice throwing has opened the era of large language models that converse like humans in about 70 years. Now, technology is advancing to the next stage: ‘Agentic AI.’ Agent AI autonomously sets goals, plans, and uses tools to perform complex tasks as active problem solvers.

Facing this dazzling future, Geoffrey Hinton warns of superintelligence risks, and Yann LeCun points out the current LLMs’ lack of ‘common sense,’ advocating for new architectures. Their debates show we stand not at the peak of technology but at a new starting point.

Key Summary

A journey starting from probability: AI began with a probabilistic approach inspired by casino probability games, seeking the ‘most plausible answer’ rather than perfect calculation.
The transformer revolution: The transformer architecture overcame sequential processing limits, maximizing parallelism and contextual understanding, ushering in the LLM era.
Alignment with humans and the future: ChatGPT was aligned with human intent via RLHF, and now AI is evolving toward ‘agent AI’ that plans and acts autonomously.

The next chapter—how we develop and responsibly integrate this powerful new technology into society—is in all our hands. Why not experience one of the domestic LLMs introduced today to feel their potential and limitations firsthand?

References

Monte Carlo Method - Namu Wiki Link
Dartmouth Workshop - Wikipedia Link
AI Winter - Wikipedia Link
What is Backpropagation? - IBM Link
ImageNet - Wikipedia Link
What is Recurrent Neural Network (RNN)? - AWS Link
[1706.03762] Attention Is All You Need - arXiv Link
The Illustrated Transformer - Jay Alammar Link
Illustrating Reinforcement Learning from Human Feedback (RLHF) - Hugging Face Link
Status and Comparison of Domestic LLM Models - MSAP.ai Link