Language Models From Scratch: From Tokens to Local LLMs

Última actualización: 02/09/2026
  • Large language models predict tokens using transformers and attention over huge text corpora, not symbolic databases.
  • Tokenizer design, parameter count, context window and temperature define how capable and creative an LLM can be.
  • Open, closed and niche LLM ecosystems plus quantization make it possible to run powerful models on consumer hardware.
  • LLMs unlock search, coding and analytics use cases, but bring challenges like hallucinations, bias, security and scaling.

Large language models from scratch

When you type on your phone and see the keyboard guessing the next word, you are getting a tiny glimpse of what a large language model (LLM) does. The difference is scale: instead of using just the last few characters or words, an LLM relies on patterns learned from an enormous portion of the text available on the internet, compressed into a giant neural network. If you ask it for the capital of Japan, it does not open a geographic database; it simply computes that, after the sequence of words you wrote, the token corresponding to “Tokyo” has an astronomically high probability of being the next output.

Understanding how these models work from the ground up is crucial if you want to build, choose, deploy or simply use them intelligently. In this guide we’ll unpack, in plain English, the full stack behind modern LLMs: tokens, transformers, parameters, context windows, temperature, tokenizer design, open vs closed ecosystems, quantization, hardware trade‑offs, training, fine‑tuning and real‑world limitations and benefits, and resources on open-source language model evaluation platforms. The goal is to demystify the jargon so you can reason about language models like a practitioner instead of treating them as black magic.

From words to tokens: how LLMs really read text

Despite how natural their responses look, LLMs do not operate on letters or full words the way humans do; they operate on tokens. A token is a small unit of text defined by a tokenizer: it might be a complete short word like “cat”, a subword prefix like “un‑”, a suffix, punctuation, or even a space character. The exact segmentation depends on how the tokenizer’s vocabulary was built.

This token-based view explains many seemingly weird behaviors of language models. Consider the classic question “How many ‘r’ letters are there in ‘strawberry’?”. Many models will answer 2, not because they cannot count, but because internally they may see the word as two atomic tokens like “straw” + “berry”. At that level, individual letters are invisible. Unless you explicitly force the model to spell the word out character by character, it cannot reliably count the “r”s because each token is treated as an indivisible symbol.

Tokenization quality has a surprisingly strong effect on how truthful and data‑efficient a model can be. Research such as the TokenMonster experiments, where 16 models from roughly 90M to 354M parameters were trained from scratch with different vocabularies, shows that careful tokenizer design outperforms older schemes like the GPT‑2 tokenizer or tiktoken’s p50k_base on multiple benchmarks. In these experiments, more efficient tokenizers improved factual accuracy on QA benchmarks (like SMLQA and SQuAD) without necessarily making the text more “fluent” or eloquent.

One key insight is that validation loss and F1 score can become misleading when you compare models built with different tokenizers. Validation loss tends to correlate extremely strongly with compression ratio (average characters per token). If a tokenizer packs more characters into each token, the loss per token naturally looks different, even if the underlying language modeling quality is similar. A more sensible comparison is loss per character. Likewise, the F1 score heavily penalizes longer answers, so models that give more detailed responses can look worse by F1 even when they are more helpful in practice.

The transformer engine and the magic of attention

Under the hood, modern LLMs are based almost exclusively on the transformer architecture introduced in 2017. The “T” in names like GPT stands for “Transformer”. This design replaced earlier recurrent and convolutional architectures because it scales far better and captures long‑range dependencies in text much more effectively.

The core innovation of transformers is the self‑attention mechanism, which lets the model look at all tokens in a sequence at once. Earlier models processed text strictly left‑to‑right and tended to “forget” the beginning of long sentences by the time they reached the end. In contrast, self‑attention assigns a learned weight to every pair of tokens, so the model can directly connect, say, the subject of a sentence with a verb many words later.

To make this work numerically, each token is first mapped to a dense vector, called an embedding. Embeddings are learned representations that place semantically related items close together in vector space. In an essay about dogs, the vectors for “bark” and “dog” will end up much closer than “bark” and “tree”, because the model has seen them co‑occur in similar contexts during training. Transformers also add positional encodings so each token knows its relative position in the sequence.

In each attention layer, every embedding is projected into three different vectors: query (Q), key (K) and value (V). Intuitively, the query expresses what the current token is “looking for” in other tokens, the key represents what each token “offers” to the others, and the value is the actual information payload that gets mixed in. Attention scores are computed as similarity between queries and keys, then normalized into weights. These weights control how much of each value vector flows into the updated representation of the token.

Stacking many self‑attention and feed‑forward layers produces rich contextual representations that encode grammar, facts and reasoning patterns. Transformers support heavy parallelization, which made it feasible to train on massive text corpora. Over time, the billions of learned parameters—essentially the network’s internal weights—encode everything from syntactic rules to world knowledge and even abstract problem‑solving strategies.

Parameters, context window and temperature: the LLM glossary

Whenever you browse AI platforms or model repositories, you will run into cryptic strings like “70B”, “8B-Instruct” or “temp=0.8”. These are not nuclear codes; they are simply shorthand for key properties that define how an LLM behaves and what hardware it needs. Understanding them will save you a lot of confusion and poor configuration choices.

Parameters are the rough analog of neurons or synapses in biological brains. They are the numerical weights that the training process adjusts to minimize prediction error. A model with 7 billion parameters (7B) has far less representational capacity than one with 400B+, just like a tiny neural network has less flexibility than a huge one. Typical informal ranges look like this:

  • 7B-9B: smaller models such as Llama‑3 8B or Gemma‑2 9B. They are light enough to run on a decent consumer PC, but if you push them into complex reasoning or niche knowledge, they are more prone to “hallucinate”—that is, produce plausible‑sounding but incorrect text.
  • 70B: mid‑sized giants like Llama‑3 70B. Here you get a strong balance between depth of reasoning and practical usability. They often require powerful GPUs or cloud deployment and can reach or exceed expert‑level performance in many tasks.
  • 400B and beyond: ultra‑large frontier models such as hypothetical GPT‑5‑class or high‑end Gemini variants. These provide enormous breadth of knowledge and reasoning, but are effectively impossible to run locally; they live in data centers and are served over APIs.

More parameters do not automatically mean “better answers” in every scenario. Larger models tend to have more robust reasoning, but quality also depends on data, training recipes, tokenizer efficiency, and fine‑tuning. Think of parameter count more as potential cognitive capacity than as an absolute quality score.

The context window is the model’s short‑term memory: how many tokens it can consider at once. Early LLMs often had context windows around 4,000 tokens, roughly equivalent to ~3,000 words of English. Modern systems can handle hundreds of thousands or even millions of tokens. That means you can feed them an entire book, multiple technical manuals and a codebase, then ask questions that rely on all of it without the model “forgetting” the earlier parts of the input.

Temperature controls the trade‑off between determinism and creativity in the sampling step. With a temperature of 0.0, the model always chooses the single most probable next token, which is ideal for code generation, math or structured data extraction where consistency matters. At temperatures around 0.8-1.0, the sampler explores less probable tokens more often, which can produce more original or surprising outputs—useful for brainstorming, storytelling or poetic writing. Pushing the temperature too high (for example above 1.5) makes the model’s output unstable and often incoherent, like a person rambling without filter.

Tokenizer design and why it matters for truthfulness

Although tokenization sounds like an implementation detail, it strongly shapes how efficiently a model learns and how accurately it recalls facts. Experiments with TokenMonster vocabularies show that, for comparable models, custom tokenizers can beat standard GPT‑2 or tiktoken vocabularies across benchmarks, even without changing the architecture.

A key result from those studies is that an intermediate vocabulary size around 32,000 tokens often works best. Smaller vocabularies have simpler structure and can converge faster during training, but they may force the model to break words into many sub‑tokens, which increases sequence length and training cost. Very large vocabularies can overfit rare patterns and make training less stable, without a corresponding gain in final quality.

Interestingly, higher compression—more characters per token—does not inherently hurt model quality. What matters more are quirks or defects in the tokenizer that make certain patterns hard to represent. Multi‑word tokens, for example, can achieve great compression but may cause a measurable drop (around 5% in some tests) on factual QA benchmarks like SMLQA, even though the character‑per‑token ratio improves by ~13%.

The research also highlights that tokenizers primarily influence the model’s ability to store and retrieve factual information, not its surface fluency. Because grammatical patterns are easier to fix during backpropagation than fragile factual associations, any wasted capacity or inefficiency at the token level tends to degrade truthfulness first. The net takeaway is simple: a better tokenizer yields a more reliable model, even if the prose style looks similar.

Types of LLMs: closed, open, open‑source and niche

The AI ecosystem has split into several camps based on how models are distributed and what you are allowed to do with them. Understanding these categories helps you pick the right tool and avoid unexpected legal or privacy headaches.

Closed or proprietary models are the big commercial names most people know. Think of large GPT releases, Gemini, Claude and similar offerings. Their advantages are obvious: cutting‑edge performance, huge context windows, advanced reasoning, multimodal capabilities and heavily optimized serving infrastructure. The flip side is that you never actually “own” these models; your prompts and data go to a third‑party server, your usage is governed by their policies and pricing, and safety filters can block or reshape answers in ways you cannot fully control.

Open‑weight models (often incorrectly called “open source” LLMs) take a middle path. Companies and research labs release the trained weights so you can download and run the models locally or on your own servers, but they usually keep the training code, hyperparameters and raw datasets proprietary. Families like Llama‑3, Mistral and Qwen are emblematic of this approach. Once the weights are on your machine, you can run them offline, protect your data, customize them and bypass censorship—subject, of course, to license terms.

Fully open‑source models go further by publishing not only the weights but also the training code and datasets. Projects such as OLMo from the Allen Institute fall into this category and are especially valuable for rigorous scientific research and reproducibility. You can audit exactly how the model was built, re‑train variants, or adapt the recipe to your own domain.

Niche or domain‑specific models trade breadth for depth in a particular area. These are smaller LLMs, often up to ten times lighter than general‑purpose giants, tuned for specialties like medicine, law or software engineering. Within their niche, they can outperform much larger generic LLMs because all of their capacity is focused on one slice of knowledge. They are also easier to deploy on modest hardware, which makes them attractive for companies that need strong performance on a narrow set of tasks.

Reading a model name like a pro

Model repositories such as Hugging Face are full of names that look like random alphabet soup. Once you know how to parse them, those names encode almost everything you need: size, purpose, format and how aggressively the weights have been compressed.

Consider this example: “Llama-3-70b-Instruct-v1-GGUF-q4_k_m”. Each piece has a specific meaning:

  1. Llama‑3: the model family and architecture, in this case Meta’s Llama‑3 line.
  2. 70b: about 70 billion parameters. This size immediately tells you that you will need serious hardware—think large‑VRAM GPU setups or a high‑end Apple machine.
  3. Instruct: indicates the model was fine‑tuned to follow natural language instructions and converse with humans. If you want a general assistant, always look for “Instruct” or “Chat” variants; raw base models may respond as if they are simply continuing a list or sequence instead of answering your question.
  4. GGUF: the file format. GGUF is optimized for running on CPUs and Apple silicon and is used by tools like LM Studio. Other common formats include EXL2, GPTQ or AWQ for GPU‑centric deployments (typically NVIDIA), and “safetensors” for raw weights that might need extra conversion.
  5. q4_k_m: a quantization tag explaining how the weights were compressed. The “4” means 4‑bit precision, a medium‑quality compromise; “k_m” refers to a particular K‑quants method that tries to shrink less important neurons more aggressively while preserving critical ones.

Being able to decode these labels lets you immediately gauge whether a model fits your hardware and use case. You can tell at a glance if it is chat‑oriented, roughly how smart it is, whether it is CPU‑friendly or GPU‑optimized, and how much accuracy you may have traded away via quantization.

Quantization: compressing giant brains to fit real hardware

State‑of‑the‑art LLMs in full precision can be absurdly large—hundreds of gigabytes of raw weights. A 70B‑parameter model in standard 16‑bit floating‑point (FP16) precision can easily exceed 140 GB, which is far beyond what a single consumer GPU can handle. This is where quantization comes in as the key technique that makes local deployment practical.

Conceptually, quantization means using fewer bits to store each weight, at the cost of some numerical precision. Instead of storing a value like 0.123456 with many decimal places, you might store something like 0.12 in a compact representation. In FP16 you have 16 bits per weight; a 4‑bit scheme uses only a quarter of that storage. The surprise from recent research (including studies from 2025) is that for many conversational and summarization tasks, going from 16 bits down to 4 bits causes only a modest drop in perceived intelligence.

Different quantization levels and methods target different hardware constraints and quality trade‑offs. A popular configuration for general users is Q4_K_M. “Q4” denotes 4 bits per weight and “K_M” indicates an advanced strategy that preferentially compresses less salient neurons. This can shrink a model by roughly 70% while retaining around 98% of its reasoning ability for everyday chat, explanation and content generation.

Pushing compression too far can effectively lobotomize the model. Q2 or IQ2 schemes, which reduce weights to 2 bits, make it possible to load huge models onto very limited GPUs, but the cost is high: frequent loops, repetitive phrases, lost logical structure and severe degradation on math or code tasks. They may still be fun to experiment with but are rarely suitable for serious work.

Quantization hits pure reasoning harder than surface writing quality. The 2025 paper “Quantization Hurts Reasoning?” found that although a quantized model can still produce fluent prose, it loses more ground on logic‑heavy benchmarks such as mathematics and advanced programming. If your main needs involve rigorous reasoning, physics problems or production‑grade code, you should use the highest precision your hardware comfortably supports—often Q6 or Q8 for local setups.

A handy rule of thumb helps estimate whether a given GPU can host a quantized model. Multiply the number of billions of parameters by about 0.7 GB to get a rough VRAM requirement for a Q4 model. For instance, an 8B model at Q4 will need about 5.6 GB of VRAM (8 × 0.7), which fits nicely on many mid‑range GPUs. A 70B model at Q4, by contrast, wants around 49 GB of VRAM, which is beyond a single consumer GPU; you would need multiple high‑end cards or a specialized server.

Running LLMs locally: NVIDIA vs Apple paths

Running a serious LLM on your own machine can feel like a hardware puzzle, and the ecosystem has coalesced around two main hardware philosophies. One pathway leans on NVIDIA GPUs and CUDA for raw speed; the other takes advantage of Apple’s unified memory architecture for sheer capacity.

On the NVIDIA side, RTX 3000, 4000 and 5000 series GPUs are the undisputed leaders in throughput. CUDA‑accelerated inference can generate tokens faster than you can read them, especially for smaller models in the 7B-13B range. If your priority is snappy interactivity—say, for coding agents or real‑time assistants—this is extremely compelling. The downside is that VRAM is expensive and capped: a flagship RTX 4090 still “only” offers 24 GB, which restricts you to around 30-35B parameters at comfortable quantization levels. Scaling to a full 70B model may require multiple cards or professional‑grade hardware.

Apple’s route centers on Macs with M‑series chips and large unified memory pools. In these systems, the same memory serves as both RAM and VRAM, which means a Mac Studio with 192 GB of unified memory can host gigantic quantized models that most consumer GPUs can only dream of. Users have reported running models like Llama‑3.1 405B (heavily quantized) or DeepSeek 67B directly on such machines. Throughput is slower than top‑tier NVIDIA cards—text is generated at a human‑readable pace rather than instant bursts—but for researchers and developers who value raw model capacity over speed, this is often the most accessible way to run “GPT‑4‑class” systems locally.

Both ecosystems are supported by user‑friendly tools that make local LLMs approachable. Two of the most popular are LM Studio and Ollama. LM Studio offers a polished graphical interface similar to ChatGPT, with integrated model search (via Hugging Face), one‑click downloads and sliders for adjusting context size, temperature, GPU vs CPU load and more. Ollama, widely favored by developers, provides both a simple GUI and powerful command‑line control, making it easy to connect local models to editors, note‑taking tools and custom apps via APIs.

The key benefit of local deployment is control: your prompts and documents never leave your machine, and no external service can silently throttle or block content. You gain privacy, reproducibility and often lower marginal cost—especially if you are running large workloads that would be expensive via hosted APIs.

From pretraining to fine‑tuning and prompting

Every LLM goes through at least two conceptual phases before you ever send it a single prompt: pretraining and adaptation. Pretraining is where the model learns general language patterns; adaptation (fine‑tuning or prompt tuning) is how it becomes useful for specific tasks.

During pretraining, the model ingests huge text corpora, often including sources like Wikipedia, books, web pages and public code repositories. It performs unsupervised learning by repeatedly trying to predict the next token in a sequence and measuring its error via a loss function. Using backpropagation and gradient descent, it adjusts billions of weights to lower that loss. Over trillions of tokens, it gradually internalizes grammar, semantics, world facts, coding idioms and basic reasoning templates.

Fine‑tuning specializes the pretrained model for a narrower activity. For instance, you can fine‑tune an LLM on parallel corpora for translation, or on labeled sentiment analysis examples, or on legal documents annotated with the correct responses. The model continues training on these task‑specific datasets, slightly modifying its parameters so that it performs better on that niche without entirely forgetting its broad capabilities.

Prompt‑based adaptation (few‑shot and zero‑shot prompting) offers a lighter‑weight alternative to fine‑tuning. In a few‑shot setup, you embed small tables or examples directly into the prompt—for example, a couple of customer reviews labeled as positive or negative—then ask the model to classify new reviews in the same style. In a zero‑shot regime, you simply describe the task in natural language (“The sentiment of ‘This plant is horrible’ is …”) and rely on the model’s prior training to figure out what to do. Modern LLMs can often perform surprisingly well in zero‑shot mode, thanks to their “in‑context learning” abilities.

Core components inside a large language model

Architecturally, LLMs are deep stacks of relatively simple building blocks that repeat many times. Understanding the major pieces clarifies what can be customized or swapped when you design or choose a model.

The embedding layer maps discrete tokens to continuous vectors. Each token index from the vocabulary is turned into a dense vector that encodes both semantic and syntactic information. These embeddings move through the network and get progressively refined by attention and feed‑forward layers.

The attention mechanism is the heart of the transformer. As described earlier, self‑attention lets each token weigh all others according to learned criteria, enabling the capture of long‑distance dependencies and contextual cues. Multi‑head attention extends this by allowing several different “views” or subspaces to attend in parallel, which enriches the representations.

The feed‑forward or “MLP” layers apply non‑linear transformations to the attended representations. After attention distills what each token should care about, the feed‑forward layers mix and reshape that information through fully connected layers and activation functions. Stacking many such blocks builds up complex hierarchical features.

By adjusting how these components are combined and scaled, you get different kinds of models. Plain “base” models just predict the next token; instruction‑tuned models learn to follow natural language directives; dialogue‑tuned models are optimized to keep multi‑turn conversations coherent and helpful.

LLMs vs. generative AI at large

It’s easy to confuse “large language models” with “generative AI”, but the latter is a broader umbrella term. Generative AI encompasses any system that can generate content—text, images, audio, video or code. LLMs are specifically text‑focused generative models, trained on language data and optimized to produce or transform textual content.

Many famous tools sit outside the LLM category even though they are generative. Image generators like DALL‑E or MidJourney create pictures rather than paragraphs. Music models, video synthesis systems and protein‑structure generators are also generative AI, but they operate in very different input and output spaces. The main shared idea is that all of them learn to map from some representation (often a prompt) to realistic outputs in their domain.

Real‑world use cases: where LLMs shine

Thanks to their flexible text understanding and generation abilities, LLMs have become core engines for a wide range of applications. Many of these were once separate subfields of NLP but now share a common foundation model.

Search and information retrieval is one of the most visible beneficiaries. Search engines can augment traditional keyword‑based indexing with semantic retrieval and LLM‑generated answers, yielding concise summaries or conversational answers instead of just a list of links. Tools like Elasticsearch Relevance Engine (ESRE) let developers combine transformer models with vector search and distributed search architectures to build their own domain‑specific semantic search experiences.

Text analytics and sentiment analysis are natural fits as well. Companies deploy LLMs to digest customer reviews, social media posts and support tickets, automatically tagging sentiment, urgency and themes. Prompt‑based or fine‑tuned classifiers can replace older machine‑learning pipelines with simpler, more adaptable setups.

Content and code generation are perhaps the most popular everyday uses. From drafting emails and marketing copy to producing poetry “in the style of” specific authors, LLMs can generate coherent, contextually appropriate text at scale. Similarly, code‑oriented models assist developers by suggesting completions, writing boilerplate, explaining snippets, or even generating entire functions from natural language descriptions, as shown by an LLM learning SwiftUI through automated feedback.

Conversational agents and chatbots are almost always powered by some form of LLM today; building them often requires careful orchestration—see design and construction of AI agent teams. In customer service, healthcare triage, personal productivity and education, conversational models interpret user intent and respond in a way that approximates human dialogue. They can remember prior messages within the context window, follow instructions and adapt tone and style.

These capabilities are impacting many industries simultaneously. In technology, LLMs speed up coding and debugging; in healthcare and life sciences, they help analyze research papers, clinical notes and even biological sequences; in marketing, they support campaign ideation and copywriting; in legal and finance, they assist with document drafting, summarization and pattern detection; in banking and security, they help spot potentially fraudulent behavior in text‑rich logs and messages.

Limits, risks and open challenges

Despite their impressive abilities, LLMs are not omniscient or infallible, and treating them as such can be dangerous. They inherit many weaknesses from their data and architecture, and new ones emerge from how we deploy them.

Hallucinations—confidently stated falsehoods—remain a major concern. Because an LLM is ultimately a next‑token predictor trained on patterns, not on grounded truth, it may fabricate plausible‑sounding details, sources or experiences. It might “explain” an API that does not exist or assert legal facts that are simply wrong. Guardrails, retrieval‑augmented generation (RAG) and human review are crucial in high‑stakes settings.

Security and privacy risks are also significant. Poorly managed models can leak sensitive training data or confidential prompts, and attackers can abuse LLMs for phishing, social engineering, spam or disinformation campaigns. Prompt‑injection attacks and data exfiltration through model outputs are active research topics.

Bias and fairness problems are deeply tied to the composition of training data—read about the LLM dependency trap. If corpora over‑represent particular demographics or viewpoints, the model will amplify those biases in its outputs, potentially marginalizing other groups or perspectives. Careful dataset curation, bias evaluation and mitigation strategies are necessary but still imperfect.

Consent and intellectual‑property issues loom large as well. Many large training datasets were assembled by scraping public content without explicit permission from authors, raising questions about copyright, data protection and ethical use. Lawsuits over unlicensed use of images or texts have already reached the courts, and regulations are evolving quickly in this area.

Finally, scaling and deployment are resource‑intensive. Training and serving frontier‑scale LLMs demand specialized hardware, distributed systems expertise, continuous monitoring and substantial energy consumption. Even for smaller models, managing latency, cost and reliability at production scale is non‑trivial.

When you put all these pieces together—tokens and tokenizers, transformers and attention, parameters and context, quantization and hardware, training and deployment—you get a clear picture of LLMs as powerful pattern learners rather than magical oracles. With the right tokenizer, architecture, compression strategy and hardware setup, you can run surprisingly capable models locally, tailor them to your domain and integrate them into search, analytics, content creation or conversational workflows, all while staying aware of their limits around truthfulness, bias, security and legal constraints.

alojar modelos de lenguaje con bajo presupuesto
Artículo relacionado:
How to Host Language Models on a Low Budget
Related posts: