5 Facts on Real World DevSecOps in 2026 (Sponsored)Datadog analyzed tens of thousands of production applications to reveal where risk is actually showing up and what it means for teams handling security today. Download the full report to dive into the key findings, including:
In December 2024, an AI lab called DeepSeek released a 671-billion-parameter language model along with a technical report describing exactly how they built it. Six months later, a different team called Moonshot AI used that report as a starting point. They scaled the design to a trillion parameters, ran into a training instability problem that emerged at that scale, invented a new optimizer to solve it, and shipped their own model. Eight months after that, a third team called Zhipu AI integrated a different DeepSeek innovation into their architecture and contributed a new training framework of their own. These three teams work for different organizations. However, they were indirectly collaborating in public, through model releases where each company was learning from what its predecessors had done before. This has been made possible by the rise of open-weight models, where even competitors get to learn from each other. The pace of that kind of collaboration has changed over the past eighteen months, and the reasons trace back to the architecture and training choices these teams made in the open. In this article, we will look at how open-weight models have transformed the AI landscape. Disclaimer: This post is based on publicly shared details from various sources. Please comment if you notice any inaccuracies. Open Weight vs Closed WeightEvery modern large language model has two important things behind it:
A closed-weight model is one that the company keeps behind an API. The user sends some text to the official API endpoint, the company’s servers run the model on their hardware, and a response comes back. The parameters stay with the organization, and running the model on personal hardware or adjusting it for a specific task remains out of reach. An open-weight model is one where the company has published the trained parameters. Anyone can download them, run the model on their own hardware, and adjust it for their own data. The training data and the full training code, however, usually stay private. The term is “open weight” rather than “open source” for this reason. In traditional software, “open source” means the full source code is available to inspect and reproduce. Most AI models marketed as open source are actually open weight, where the trained model is public, while the process that produced it remains closed. This distinction matters because the published weights, paired with detailed technical reports, are what allow other teams to study a design and build on top of it. Different open-weight models also use different licenses, ranging from very permissive ones like MIT and Apache 2.0 to custom licenses with specific commercial restrictions, so the practical freedoms vary across the ecosystem. See the diagram below that shows the difference between accessing a closed-weight model and an open-weight model: The MoE ArchitectureEvery major open-weight LLM released at the frontier in 2025 and 2026 shares the same architectural skeleton. It is called a Mixture-of-Experts transformer, or MoE for short. Modern LLMs are built from stacked “blocks.” Each block has two main parts, an attention layer that figures out which previous words matter for the next one, and a feed-forward layer that does the actual computation. In a regular (”dense”) model, every parameter activates for every word the model processes. Adding more parameters to make a smarter model means the cost of running it scales linearly with that count. With hundreds of billions of parameters, this becomes impractical. MoE solves this by replacing the single feed-forward layer in each block with several smaller “expert” sub-networks, plus a small routing component that picks which experts to use for each word. The model can store knowledge across many experts while only computing a few of them per word. This is why two numbers matter for every MoE model:
DeepSeek V3, for example, has 671 billion total parameters but only 37 billion active per word. Kimi K2 has a trillion total, but 32 billion active. Qwen3 has 235 billion total and 22 billion active. When comparing the cost of running these models, the active counts are what matter, rather than the totals. A trillion-parameter model and a 235-billion-parameter model can cost roughly the same per query if their active counts are similar. See the diagram below that shows how an MoE block works: |