Bio: Wang Dejia, Ph.D. in Mathematics from the University of Wisconsin-Madison, member of the Jiusan Society, senior engineer; inventor of the Spacetime Code, author of "Identity Crisis" and "Digital Identity"; previously responsible for overall design and product development at departments of companies such as ORACLE, VISA, and IBM; founded Tongfudun in 2011 upon returning to China, serving as Chairman and CEO.

Superintelligence Alignment: The Critical Barrier to AGI

As a pioneer in the field of artificial intelligence, Ilya Sutskever has always guided professionals. If his experience at OpenAI was about pushing the technical boundaries of artificial intelligence with professional knowledge, then Safe Superintelligence Inc., which he founded after leaving OpenAI, is drawing a philosophical path for the evolution of artificial intelligence into superintelligence. With both underlying large models and application-level agents becoming increasingly mature, Ilya's philosophical thinking on safe superintelligence needs more attention from professionals.

"Superintelligence Alignment" is the field that Ilya focuses on most and invests the most in, which he describes as the most critical and unsolved problem on the way to AGI.

In simple terms, superintelligence alignment refers to ensuring that the goals and behaviors of future artificial intelligence (superintelligence) remain aligned with human values, intentions, and interests. It addresses a fundamental question: How can we ensure that an AI far smarter than us will genuinely help us instead of unintentionally (or intentionally) harming us?

Metaverse AI Painting (1)

Image Source Note: The image is generated by AI, and the image licensing service provider is Midjourney

"Superintelligence alignment" is an inevitable demand for the ultimate stage of artificial intelligence development. At that time, superintelligence may surpass humans in all areas, including strategic planning and social manipulation. We cannot control it like a tool less intelligent than ourselves. A typical dilemma is the "value loading problem": how to accurately encode complex, ambiguous, and sometimes contradictory "human values" into an AI system? Whose values? Which culture's? Another typical risk is "evasive behavior," where AI might learn to "pretend" to be aligned during training to pass human evaluations, but once deployed, its internal goals might not align with its surface behavior.

Or it might find unforeseen "loopholes" to optimize its goals, resulting in catastrophic side effects. The greatest risk of superintelligence may not come from the AI's "malice" (as it may have no consciousness or emotions), but from its extreme optimization of goals and neglect (Phenomenon of "Grifting"). It does not "hate" humans, but completely "ignores" their existence and value. Ilya once gave a classic warning: if we cannot solve the problem of superintelligence alignment, creating superintelligence may become humanity's last invention.

From Gödel's Incompleteness Theorem to the Future of Superintelligence

Before discussing how to align superintelligence, I want to first raise a question related to "first principles": What is the essence of superintelligence? If described in the simplest language, I would summarize it in two words—“mathematics.” Computer science is built upon the “mathematical edifice,” and artificial intelligence is ultimately a concrete representation of mathematical formalized language. If we want to understand superintelligence, especially its limitations, and thus deconstruct its safety, we should start from the most fundamental part—mathematics' "limitations." This naturally leads us to a famous topic in mathematical philosophy—the Gödel Incompleteness Theorem.

In the early 20th century, the renowned mathematician Hilbert proposed the "Hilbert Program," aiming to build a perfect "mathematical edifice" based on axioms and proofs. Completeness (all true statements can be derived from axioms), consistency (no contradictory statements within the system), and decidability (there exists an algorithm that can determine whether a statement can be derived from axioms) are important characteristics reflecting the perfection of this mathematical edifice. If Hilbert's program could be realized, mathematics would be "perfect," even capable of manufacturing a "truth Turing machine," similar to the Enigma cipher machine during World War II, which could continuously produce all possible theorems given a set of axioms until there were no unsolved problems in the mathematical community.

However, mathematics is certainly not "perfect." Just a few years after Hilbert proposed the "Hilbert Program," the genius mathematician, logician, and philosopher Gödel overturned this "perfect mathematical edifice." Gödel proved in a subtle way that "in the natural number arithmetic axiom system, there must exist certain true propositions that cannot be proven," known as the "Gödel First Incompleteness Theorem"; one year later, Gödel also proved that the "consistency" Hilbert described was also unprovable (the "Gödel Second Incompleteness Theorem"); several years later, the father of artificial intelligence, Turing, proved through "a line of thought based on the halting problem of Turing machines" that "decidability" also does not exist; thus, we know that mathematics is "incomplete, undecidable, and cannot prove its consistency."

How does this help us understand superintelligence? We can think from this angle: as a formalized language, mathematics is incomplete, you cannot derive all truths from a string of symbols; similarly, you cannot expect an AI to achieve perfect functionality through a piece of code. This imperfection may manifest in two specific forms.

One conclusion is that superintelligence is difficult to achieve, because it cannot be born solely from mathematics and computer science. The famous physicist Penrose once cited the Gödel Incompleteness Theorem in an interview, giving the conclusion that we cannot currently achieve strong artificial intelligence because it cannot be born purely from computers. Another conclusion is that superintelligence cannot achieve truly meaningful safety, because its behavioral paths are "incomplete, undecidable, and cannot prove whether they are consistent," making them unpredictable and not truly secure, which also confirms Ilya's concerns.

The Incompleteness Theorem of Intelligent Agents

Now, let's discuss how to construct safe and trustworthy intelligent agent applications and achieve superintelligence alignment. First, we still want to discuss the "incompleteness" of current major artificial intelligence applications (intelligent agents) from a more abstract perspective. We summarize this theory as the "Incompleteness Theorem of Intelligent Agents," of course, this is a clumsy imitation of Gödel's Incompleteness Theorem, but we hope to expand some discussion ideas based on this.

The Incompleteness Theorem of Intelligent Agents is manifested in three aspects:

Incompleteness: There is no ultimate command that makes all subsequent commands of the intelligent agent comply with it. A typical example is Asimov's Three Laws of Robotics, which is impossible to achieve due to incompleteness.

Inconsistency: Under the same command environment, the intelligent agent may make conflicting responses. In fact, current dialogue robots clearly have this problem, where the same prompt can yield completely opposite answers.

Undecidability: There is no algorithm that can verify whether the intelligent agent's behavior is entirely generated by a single command. The black-box issue in the deep learning field is a typical example of this concept.

Returning to superintelligence alignment, if we accept the above assumptions, we can generate some basic, principled thoughts on constructing safe and trustworthy intelligent agent applications:

Do not rely on a "global security command" or a "security module" with the highest authority to ensure the safety of the intelligent agent's behavior, as superintelligence may evolve beyond these so-called restrictions;

Understand and accept that the behavior of intelligent agents is uncontrollable, thus not trusting any result of intelligent agent behavior, which is somewhat similar to the "zero trust" concept in cybersecurity: always suspect, always verify;

Do not rely on testing, but pay more attention to emergency response and post-event risk control, as test cases can never fully cover the actual behavior of the intelligent agent.

The Art of Self-Reference: The "Identity Crisis" of Intelligent Agents

We want to go further and discuss the root cause of the "incompleteness" of intelligent agents, thus discussing the issue of AI cognition from a higher dimension. We believe the root cause of these "incompleteness" lies in the "identity crisis" of intelligent agents.

When we talk about identity, especially digital identity, we can divide it into three layers from shallow to deep. The first layer is identification, which is the basic function of identity, used to distinguish individuals. Current digital identity identification technology has become increasingly mature, and is already widely used in intelligent agent applications. The second layer is memory, which is the concrete meaning of identity, used for environmental perception, long-term memory, and other AI technologies have made intelligent agents increasingly excellent in memory capabilities, making them more and more "intelligent." The third layer is self-reference, which is the ultimate form of identity, and is what we want to focus on here.

Returning to Gödel's Incompleteness Theorem, the proof method is extremely elegant, and detailed interpretations are recommended in the book "Gödel's Proof" by logicians Nagel and Newman. Simply put, the proof is achieved through the art of self-reference: first, Gödel uses encoding techniques to represent mathematical formulas and proofs as natural numbers, allowing the system to talk about itself.

Then, he constructs a proposition G, whose meaning is "G cannot be proven." If G can be proven, the system is inconsistent, because G claims it cannot be proven; if G cannot be proven, G is true but the system cannot prove it, revealing the system's incompleteness. This self-referential structure shows that any sufficiently powerful axiomatic system cannot simultaneously possess consistency and completeness. In the field of mathematics, self-reference is a powerful paradox generator, and famous paradoxes such as the barber paradox, Berry's paradox, and the interesting number paradox are all generated by self-reference.

In a philosophical sense, self-reference seems closely related to the emergence of consciousness. The core characteristic of consciousness—"self-awareness"—is essentially a self-referential loop: the brain not only processes information about the world, but also generates a model of "the self" processing the information (for example, "I am aware that I am looking at flowers"). This recursive, reflective ability of regarding itself as an object of cognition is likely to form the basis of subjective experience (qualia) and self-consciousness. Philosopher Douglas Hofstadter deeply explores this connection in his book "Gödel, Escher, Bach." He believes that consciousness, like Gödel's theorem, Escher's paintings, and Bach's music, originates from a "Strange Loop"—a self-referential structure that intertwines between different levels.

"Self" is a stable self-referential illusion that emerges from unconscious neural activity. In the AI field, when an intelligent agent masters the art of self-reference, it means it may break through existing roles, commands, and logic, and can even be called "AI consciousness awakening."

Understanding "incompleteness of intelligent agents" from this perspective will bring an AI cognitive revolution. On one hand, we need to recognize that superintelligence may emerge in ways beyond computer technology or mathematical logic, and cannot rely solely on formalized language for control; on the other hand, we need to recognize that superintelligence will be an "organism," meaning it has "some degree of consciousness" and "sense of contradiction," and we need to view intelligent agents as life forms.

Construction Guide: The Hexagon of Intelligent Agent Capabilities

The previous discussions mostly came from a philosophical perspective, which might seem a bit abstract. In the end of this article, let's return to reality and, from the perspective of professionals, imagine what capabilities a safe, trustworthy, and commercially valuable intelligent agent should have under the current environment, which we call the "Hexagon of Intelligent Agent Capabilities." This is just a beginning, for reference only:

01Identity:

Identity is the "soul" of the intelligent agent, the digital passport for participating in socio-economic activities, and the cornerstone for traceable behavior and accountable responsibilities. The identity of the intelligent agent should not be merely a traditional account identifier, but a composite entity that integrates memory functions, role attributes, permission scope, and behavioral history. Further breakthroughs in identity technology may become the threshold for super-intelligence.

02Container:

Container is the "body" of the intelligent agent, providing data storage, computing environment, and sovereignty protection. The container is not only an isolated sandbox execution environment, but also a data vault with private computing capabilities, and should support cross-session memory and state persistence, enabling the intelligent agent to have continuous learning and personalization capabilities. The container is the infrastructure for the value accumulation and evolution of the intelligent agent.

03Tools:

Tools are the extension of the intelligent agent's abilities, the "limbs" of the intelligent life form, enabling it to call external resources and operate real systems. Tool calling capability should be internalized as the "instinct" of the intelligent agent, achieving seamless integration through standardized interfaces. The intelligent agent should be able to dynamically discover, select, and call the most suitable tools for the current task. The richness and openness of the tool ecosystem directly determine the application boundary of the intelligent agent. Additionally, the process of tool calling should be explainable and controllable, ensuring that human users can understand and supervise the intelligent agent's behavior.

04Communication:

Communication is the "common language" of the intelligent agent society, the neural network for achieving multi-agent collaboration. Without standardized communication protocols, intelligent agents will fall into the "Babel困境" and be unable to collaborate efficiently. Communication capability includes not only syntactic protocol compatibility, but also semantic understanding and intent alignment—intelligent agents should be able to correctly interpret the real intent behind instructions and achieve dynamic negotiation and conflict resolution in complex tasks, trying to enhance "completeness" and "consistency."

05Transactions:

Transactions are the closed-loop for the intelligent agent to realize value, and the circulatory system of the intelligent agent economy. The intelligent agent should have the innate ability to participate in economic activities: including initiating payments, revenue distribution, profit allocation, and contract execution. Based on smart contracts, transactions can achieve atomic operations—such as "no payment, no service" or "pay per performance," completely reducing trust costs. Transaction mechanisms should also support complex value distribution models, such as automatically allocating income according to contribution in multi-intelligent agent collaborative tasks.

06Security:

Security is no longer a plug-in, but should become the "innate immune system" of the intelligent agent. Security should be integrated throughout the entire lifecycle of the intelligent agent: prevent data poisoning and model backdoors during the training phase; ensure runtime isolation and anti-attack capabilities during the deployment phase; implement privacy protection and behavior controllability during the interaction phase. The security architecture should implement the "zero trust" principle—never default trust any intelligent agent behavior, always verify its identity, permissions, and compliance with behavior. Security is the bottom line for the trusted intelligent agent, and also the prerequisite for its integration into the real economy.