The right way to Optimize your Python Program for Slowness

Additionally obtainable: A Rust model of this text.

talks about making Python applications sooner [1, 2, 3], however what if we pursue the alternative purpose? Let’s discover how one can make them slower — absurdly slower. Alongside the way in which, we’ll look at the character of computation, the function of reminiscence, and the size of unimaginably giant numbers.

Our guiding problem: write quick Python applications that run for an awfully very long time.

To do that, we’ll discover a sequence of rule units — each defining what sort of applications we’re allowed to jot down, by putting constraints on halting, reminiscence, and program state. This sequence isn’t a development, however a collection of shifts in perspective. Every rule set helps reveal one thing totally different about how easy code can stretch time.

Listed here are the rule units we’ll examine:

Something Goes — Infinite Loop
Should Halt, Finite Reminiscence — Nested, Mounted-Vary Loops
Infinite, Zero-Initialized Reminiscence — 5-State Turing Machine
Infinite, Zero-Initialized Reminiscence — 6-State Turing Machine (>10↑↑15 steps)
Infinite, Zero-Initialized Reminiscence — Plain Python (compute 10↑↑15 with out Turing machine emulation)

Apart: 10↑↑15 just isn’t a typo or a double exponent. It’s a quantity so giant that “exponential” and “astronomical” don’t describe it. We’ll outline it in Rule Set 4.

We begin with probably the most permissive rule set. From there, we’ll change the principles step-by-step to see how totally different constraints form what long-running applications appear to be — and what they’ll train us.

Rule Set 1: Something Goes — Infinite Loop

We start with probably the most permissive guidelines: this system doesn’t have to halt, can use limitless reminiscence, and might comprise arbitrary code.

If our solely purpose is to run without end, the answer is quick:

whereas True:
  move

This program is brief, makes use of negligible reminiscence, and by no means finishes. It satisfies the problem in probably the most literal manner — by doing nothing without end.

After all, it’s not attention-grabbing — it does nothing. However it offers us a baseline: if we take away all constraints, infinite runtime is trivial. Within the subsequent rule set, we’ll introduce our first constraint: this system should ultimately halt. Let’s see how far we are able to stretch the operating time beneath that new requirement — utilizing solely finite reminiscence.

Rule Set 2: Should Halt, Finite Reminiscence — Nested, Mounted-Vary Loops

If we would like a program that runs longer than the universe will survive after which halts, it’s straightforward. Simply write two nested loops, every counting over a set vary from 0 to 10¹⁰⁰−1:

for a in vary(10**100):
  for b in vary(10**100):
      if b % 10_000_000 == 0:
          print(f"{a:,}, {b:,}")

You may see that this program halts after 10¹⁰⁰ × 10¹⁰⁰ steps. That’s 10²⁰⁰. And — ignoring the print—this program makes use of solely a small quantity of reminiscence to carry its two integer loop variables—simply 144 bytes.

My desktop pc runs this program at about 14 million steps per second. However suppose it may run at Planck velocity (the smallest significant unit of time in physics). That might be about 10⁵⁰ steps per yr — so 10¹⁵⁰ years to finish.

Present cosmological fashions estimate the warmth demise of the universe in 10¹⁰⁰ years, so our program will run about 100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 instances longer than the projected lifetime of the universe.

Apart: Sensible considerations about operating a program past the top of the universe are exterior the scope of this text.

For an added margin, we are able to use extra reminiscence. As a substitute of 144 bytes for variables, let’s use 64 gigabytes — about what you’d discover in a well-equipped private pc. That’s about 500 million instances extra reminiscence, which provides us about one billion variables as an alternative of two. If every variable iterates over the complete 10¹⁰⁰ vary, the whole variety of steps turns into roughly 10¹⁰⁰^(10⁹), or about 10^(100 billion) steps. At Planck velocity — roughly 10⁵⁰ steps per yr — that corresponds to 10^(100 billion − 50) years of computation.

Can we do higher? Nicely, if we enable an unrealistic however attention-grabbing rule change, we are able to do a lot, a lot better.

Rule Set 3: Infinite, Zero-Initialized Reminiscence — 5-State Turing Machine

What if we enable infinite reminiscence — as long as it begins out fully zero?

Apart: Why don’t we enable infinite, arbitrarily initialized reminiscence? As a result of it trivializes the problem. For instance, you might mark a single byte far out in reminiscence with a 0x01—say, at place 10¹²⁰—and write a tiny program that simply scans till it finds it. That program would take an absurdly very long time to run — but it surely wouldn’t be attention-grabbing. The slowness is baked into the info, not the code. We’re after one thing deeper: small applications that generate their very own lengthy runtimes from easy, uniform beginning circumstances.

My first thought was to make use of the reminiscence to rely upward in binary:

We are able to try this — however how do we all know when to cease? If we don’t cease, we’re violating the “should halt” rule. So, what else can we strive?

Let’s take inspiration from the daddy of Pc Science, Alan Turing. We’ll program a easy summary machine — now often known as a Turing machine — beneath the next constraints:

The machine has infinite reminiscence, laid out as a tape that extends endlessly in each instructions. Every cell on the tape holds a single bit: 0 or 1.
A learn/write head strikes throughout the tape. On every step, it reads the present bit, writes a brand new bit (0 or 1), and strikes one cell left or proper.

A learn/write head positioned on an infinite tape.

The machine additionally has an inner variable referred to as state, which may maintain one among n values. For instance, with 5 states, we’d title the doable values A, B, C, D, and E—plus a particular halting state H, which we don’t rely among the many 5. The machine all the time begins within the first state, A.

We are able to categorical a full Turing machine program as a transition desk. Right here’s an instance we’ll stroll by step-by-step.

A 5-state Turing machine transition desk.

Every row corresponds to the present tape worth (0 or 1).

Every column corresponds to the present state (A by E).

Every entry within the desk tells the machine what to do subsequent:

The first character is the bit to jot down (0 or 1)

The second is the route to maneuver (L for left, R for proper)

The third is the subsequent state to enter (A, B, C, D, E, or H, the place H is the particular halting state).

Now that we’ve outlined the machine, let’s see the way it behaves over time.

READ ALSO

Could Should-Reads: Math for Machine Studying Engineers, LLMs, Agent Protocols, and Extra

LLM Optimization: LoRA and QLoRA | In direction of Information Science

We’ll refer to every second in time — the complete configuration of the machine and tape — as a step. This contains the present tape contents, the top place, and the machine’s inner state (like A, B, or H).

Under is Step 0. The pinnacle is pointing to a 0 on the tape, and the machine is in state A.

row 0, column A in this system desk, we discover the instruction 1RB. Meaning:

Write 1 to the present tape cell.

Transfer the top Proper.

Enter state B.

Step 0:

This places us in Step 1:

The machine is now in state B, pointing on the subsequent tape cell (once more 0).

What is going to occur if we let this Turing machine hold operating? It’s going to run for precisely 47,176,870 steps — after which halt.

Apart: With a Google register, you’ll be able to run this your self through a Python pocket book on Google Colab. Alternatively, you’ll be able to copy and run the pocket book domestically by yourself pc by downloading it from GitHub.

That quantity 47,176,870 is astonishing by itself, however seeing the complete run makes it extra tangible. We are able to visualize the execution utilizing a space-time diagram, the place every row reveals the tape at a single step, from high (earliest) to backside (newest). Within the picture:

The primary row is clean — it reveals the all-zero tape earlier than the machine takes its first step.

1s are proven in orange.

0s are proven in white.

Gentle orange seems the place 0s and 1s are so shut collectively they mix.

House-time diagram for the champion 5-state Turing machine. It runs for 47,176,870 steps earlier than halting. Every row reveals the tape at a single step, ranging from the highest. Orange represents 1, white represents 0.

In 2023, a web based group of novice researchers organized by bbchallenge.org proved that that is the longest-running 5-state Turing machine that ultimately halts.

Need to see this Turing machine in movement? You may watch the complete 47-million-step execution unfold on this pixel-perfect video:

Or work together with it immediately utilizing the Busy Beaver Blaze net app.

The video generator and net app are a part of busy-beaver-blaze, the open-source Python & Rust venture that accompanies this text.

It’s onerous to consider that such a small machine can run 47 million steps and nonetheless halt. However it will get much more astonishing: the group at bbchallenge.org discovered a 6-state machine with a runtime so lengthy it may’t even be written with atypical exponents.

Rule Set 4: Infinite, Zero-Initialized Reminiscence — 6-State Turing Machine (>10↑↑15 steps)

As of this writing, the longest operating (however nonetheless halting) 6-state Turing machine recognized to humankind is:

A   B   C   D   E   F
0   1RB 1RC 1LC 0LE 1LF 0RC
1   0LD 0RF 1LA 1RH 0RB 0RE

Here’s a video displaying its first 10 trillion steps:

And right here you’ll be able to run it interactively through an internet app.

So, if we’re affected person — comically affected person — how lengthy will this Turing machine run? Greater than 10↑↑15 the place “10 ↑↑ 15” means:

That is not the identical as 10¹⁵ (which is only a common exponent). As a substitute:

10¹ = 10
10¹⁰ = 10,000,000,000
10^10^10 is 10¹⁰⁰⁰⁰⁰⁰⁰⁰⁰⁰, already unimaginably giant.
10↑↑4 is so giant that it vastly exceeds the variety of atoms within the observable universe.
10↑↑15 is so giant that writing it in exponent notation turns into annoying.

Pavel Kropitz introduced this 6-state machine on Could 30, 2022. Shawn Ligocki has a terrific write up explaining each his and Pavel’s discoveries. To show that these machines run so lengthy after which halt, researchers used a mixture of evaluation and automatic instruments. Fairly than simulating each step, they recognized repeating constructions and patterns that may very well be confirmed — utilizing formal, machine-verified proofs — to ultimately result in halting.

Up up to now, we’ve been speaking about Turing machines — particularly, the longest-known 5- and 6-state machines that ultimately halt. We ran the 5-state champion to completion and watched visualizations to discover its habits. However the discovery that it’s the longest halting machine with 5 states — and the identification of the 6-state contender — got here from intensive analysis and formal proofs, not from operating them step-by-step.

That mentioned, the Turing machine interpreter I in-built Python can run for thousands and thousands of steps, and the visualizer written in Rust can deal with trillions (see GitHub). However even 10 trillion steps isn’t an atom in a drop of water within the ocean in comparison with the complete runtime of the 6-state machine. And operating it that far doesn’t get us any nearer to understanding why it runs so lengthy.

Apart: Python and Rust “interpreted” the Turing machines as much as some level — studying their transition tables and making use of the principles step-by-step. You could possibly additionally say they “emulated” them, in that they reproduced their habits precisely. I keep away from the phrase “simulated”: a simulated elephant isn’t an elephant, however a simulated pc is a pc.

Returning to our central problem: we need to perceive what makes a brief program run for a very long time. As a substitute of analyzing these Turing machines, let’s assemble a Python program whose 10↑↑15 runtime is clear by design.

Rule Set 5: Infinite, Zero-Initialized Reminiscence — Plain Python (compute 10↑↑15 with out Turing machine emulation)

Our problem is to jot down a small Python program that runs for at the very least 10↑↑15 steps, utilizing any quantity of zero-initialized reminiscence.

To attain this, we’ll compute the worth of 10↑↑15 in a manner that ensures this system takes at the very least that many steps. The ↑↑ operator is known as tetration—recall from Rule Set 4 that ↑↑ stacks exponents: for instance, 10↑↑3 means 10^(10^10). It’s an especially fast-growing operate. We’ll program it from the bottom up.

Fairly than depend on built-in operators, we’ll outline tetration from first rules:

Tetration, carried out by the operate tetrate, as repeated exponentiation
Exponentiation, through exponentiate, as repeated multiplication
Multiplication, through multiply, as repeated addition
Addition, through add, as repeated increment

Every layer builds on the one beneath it, utilizing solely zero-initialized reminiscence and in-place updates.

We’ll start on the basis — with the best operation of all: increment.

Increment

Right here’s our definition of increment and an instance of its use:

from gmpy2 import xmpz

def increment(acc_increment):
  assert is_valid_accumulator(acc_increment), "not a legitimate accumulator"
  acc_increment += 1

def is_valid_accumulator(acc):
  return isinstance(acc, xmpz) and acc >= 0  

b = xmpz(4)
print(f"++{b} = ", finish="")
increment(b)
print(b)
assert b == 5

Output:

++4 = 5

We’re utilizing xmpz, a mutable arbitrary-precision integer sort supplied by the gmpy2 library. It behaves like Python’s built-in int by way of numeric vary—restricted solely by reminiscence—however in contrast to int, it helps in-place updates.

To remain true to the spirit of a Turing machine and to maintain the logic minimal and observable, we limit ourselves to just some operations:

Creating an integer with worth 0 (xmpz(0))
In-place increment (+= 1) and decrement (-= 1)
Evaluating with zero

All arithmetic is finished in-place, with no copies and no non permanent values. Every operate in our computation chain modifies an accumulator immediately. Most capabilities additionally take an enter worth a, however increment—being probably the most primary—doesn’t. We use descriptive names like increment_acc, add_acc, and so forth to make the operation clear and to help later capabilities the place a number of accumulators will seem collectively.

Apart: Why not use Python’s built-in int sort? It helps arbitrary precision and might develop as giant as your reminiscence permits. However it’s additionally immutable, that means any replace like += 1 creates a new integer object. Even if you happen to suppose you’re modifying a big quantity in place, Python is definitely copying all of its inner reminiscence—regardless of how massive it’s.
For instance:

x = 10**100
y = x
x += 1
assert x == 10**100 + 1 and y == 10**100

Regardless that x and y begin out equivalent, x += 1 creates a brand new object—leaving y unchanged. This habits is okay for small numbers, but it surely violates our guidelines about reminiscence use and in-place updates. That’s why we use gmpy2.xmpz, a mutable arbitrary-precision integer that actually helps environment friendly, in-place modifications.

Addition

With increment outlined, we subsequent outline addition as repeated incrementing.

def add(a, add_acc):
  assert is_valid_other(a), "not a legitimate different"
  assert is_valid_accumulator(add_acc), "not a legitimate accumulator"
  for _ in vary(a):
      add_acc += 1

def is_valid_other(a):
  return isinstance(a, int) and a >= 0      

a = 2
b = xmpz(4)
print(f"Earlier than: id(b) = {id(b)}")
print(f"{a} + {b} = ", finish="")
add(a, b)
print(b)
print(f"After:  id(b) = {id(b)}")  # ← examine object IDs
assert b == 6

Output:

Earlier than: id(b) = 2082778466064
2 + 4 = 6
After:  id(b) = 2082778466064

The operate provides a to add_acc by incrementing add_acc one step at a time, a instances. The earlier than and after ids are the identical, displaying that no new object was created—add_acc was actually up to date in place.

Apart: You would possibly surprise why add doesn’t simply name our increment operate. We may write it that manner—however we’re intentionally inlining every stage by hand. This retains all loops seen, makes management circulation express, and helps us cause exactly about how a lot work every operate performs.

Regardless that gmpy2.xmpz helps direct addition, we don’t use it. We’re working on the most primitive stage doable—incrementing by 1—to maintain the logic easy, deliberately sluggish, and to make the quantity of labor express.

As with increment_acc, we replace add_acc in place, with no copying or non permanent values. The one operation we use is += 1, repeated a instances.

Subsequent, we outline multiplication.

Multiplication

With addition in place, we are able to now outline multiplication as repeated addition. Right here’s the operate and instance utilization. Not like add and increment, this one builds up a brand new xmpz worth from zero and returns it.

def multiply(a, multiply_acc):
  assert is_valid_other(a), "not a legitimate different"
  assert is_valid_accumulator(multiply_acc), "not a legitimate accumulator"

  add_acc = xmpz(0)
  for _ in count_down(multiply_acc):
      for _ in vary(a):
          add_acc += 1
  return add_acc

def count_down(acc):
  assert is_valid_accumulator(acc), "not a legitimate accumulator"
  whereas acc > 0:
      acc -= 1
      yield

a = 2
b = xmpz(4)
print(f"{a} * {b} = ", finish="")
c = multiply(a, b)
print(c)
assert c == 8
assert b == 0

Output:

2 * 4 = 8

This multiplies a by the worth of multiply_acc by including a to add_acc as soon as for each time multiply_acc might be decremented. The result’s returned after which assigned to c. The unique multiply_acc is decremented to zero and consumed within the course of.

You would possibly surprise what this line does:

for _ in count_down(multiply_acc):

Whereas xmpz technically works with vary(), doing so converts it to a normal Python int, which is immutable. That triggers a full copy of its inner reminiscence—an costly operation for big values. Worse, every decrement step would contain allocating a brand new integer and copying all earlier bits, so what ought to be a linear loop finally ends up doing quadratic complete work. Our customized count_down() avoids all that by decrementing in place, yielding management with out copying, and sustaining predictable reminiscence use.

We’ve constructed multiplication from repeated addition. Now it’s time to go a layer additional: exponentiation.

Exponentiation

We outline exponentiation as repeated multiplication. As earlier than, we carry out all work utilizing solely incrementing, decrementing, and in-place reminiscence. As with multiply, the ultimate result’s returned whereas the enter accumulator is consumed.

Right here’s the operate and instance utilization:

def exponentiate(a, exponentiate_acc):
  assert is_valid_other(a), "not a legitimate different"
  assert is_valid_accumulator(exponentiate_acc), "not a legitimate accumulator"
  assert a > 0 or exponentiate_acc != 0, "0^0 is undefined"

  multiply_acc = xmpz(0)
  multiply_acc += 1
  for _ in count_down(exponentiate_acc):
      add_acc = xmpz(0)
      for _ in count_down(multiply_acc):
          for _ in vary(a):
              add_acc += 1
      multiply_acc = add_acc
  return multiply_acc


a = 2
b = xmpz(4)
print(f"{a}^{b} = ", finish="")
c = exponentiate(a, b)
print(c)
assert c == 16
assert b == 0

Output:

2^4 = 16

This raises a to the facility of exponentiate_acc, utilizing solely incrementing, decrementing, and loop management. We initialize multiply_acc to 1 with a single increment—as a result of repeatedly multiplying from zero would get us nowhere. Then, for every time exponentiate_acc might be decremented, we multiply the present consequence (multiply_acc) by a. As with the sooner layers, we inline the multiply logic immediately as an alternative of calling the multiply operate—so the management circulation and step rely keep totally seen.

Apart: And what number of instances is += 1 referred to as? Clearly at the very least 2⁴ instances—as a result of our result’s 2⁴, and we attain it by incrementing from zero. Extra exactly, the variety of increments is:

• 1 increment — initializing multiply_acc to 1

Then we loop 4 instances, and in every loop, we multiply the present worth of multiply_acc by a = 2, utilizing repeated addition:

• 2 increments — for multiply_acc = 1, add 2 as soon as
• 4 increments — for multiply_acc = 2, add 2 twice
• 8 increments — for multiply_acc = 4, add 2 4 instances
• 16 increments — for multiply_acc = 8, add 2 eight instances

That’s a complete of 1 + 2 + 4 + 8 + 16 = 31 increments, which is 2⁵-1. Generally, the variety of calls to increment might be exponential, however the quantity just isn’t the identical exponential that we’re computing.

With exponentiation outlined, we’re prepared for the highest of our tower: tetration.

Tetration

Right here’s the operate and instance utilization:

def tetrate(a, tetrate_acc):
  assert is_valid_other(a), "not a legitimate different"
  assert is_valid_accumulator(tetrate_acc), "not a legitimate accumulator"
  assert a > 0, "we do not outline 0↑↑b"

  exponentiate_acc = xmpz(0)
  exponentiate_acc += 1
  for _ in count_down(tetrate_acc):
      multiply_acc = xmpz(0)
      multiply_acc += 1
      for _ in count_down(exponentiate_acc):
          add_acc = xmpz(0)
          for _ in count_down(multiply_acc):
              for _ in vary(a):
                  add_acc += 1
          multiply_acc = add_acc
      exponentiate_acc = multiply_acc
  return exponentiate_acc


a = 2
b = xmpz(3)
print(f"{a}↑↑{b} = ", finish="")
c = tetrate(a, b)
print(c)
assert c == 16  # 2^(2^2)
assert b == 0   # Affirm tetrate_acc is consumed

Output:

2↑↑3 = 16

This computes a ↑↑ tetrate_acc, that means it exponentiates a by itself repeatedly, tetrate_acc instances.

For every decrement of tetrate_acc, we exponentiate the present worth. We in-line your entire exponentiate and multiply logic once more, all the way in which right down to repeated increments.

As anticipated, this computes 2^(2^2) = 16. With a Google sign-in, you’ll be able to run this your self through a Python pocket book on Google Colab. Alternatively, you’ll be able to copy the pocket book from GitHub after which run it by yourself pc.

We are able to additionally run tetrate on 10↑↑15. It’s going to begin operating, but it surely received’t cease throughout our lifetimes — and even the lifetime of the universe:

a = 10
b = xmpz(15)
print(f"{a}↑↑{b} = ", finish="")
c = tetrate(a, b)
print(c)

Let’s examine this tetrate operate to what we discovered within the earlier Rule Units.

Rule Set 1: Something Goes — Infinite Loop

Recall our first operate:

whereas True:
  move

Not like this infinite loop, our tetrate operate ultimately halts — although not anytime quickly.

Rule Set 2: Should Halt, Finite Reminiscence — Nested, Mounted-Vary Loops

Recall our second operate:

for a in vary(10**100):
  for b in vary(10**100):
      if b % 10_000_000 == 0:
          print(f"{a:,}, {b:,}")

Each this operate and our tetrate operate comprise a set variety of nested loops. However tetrate differs in an essential manner: the variety of loop iterations grows with the enter worth. On this operate, in distinction, every loop runs from 0 to 10¹⁰⁰-1—a hardcoded certain. In distinction, tetrate’s loop bounds are dynamic — they develop explosively with every layer of computation.

Rule Units 3 & 4: Infinite, Zero-Initialized Reminiscence — 5- and 6-State Turing Machines

In comparison with the Turing machines, our tetrate operate has a transparent benefit: we are able to immediately see that it’s going to name += 1 greater than 10↑↑15 instances. Even higher, we are able to additionally see — by development — that it halts.

What the Turing machines provide as an alternative is an easier, extra common mannequin of computation — and maybe a extra principled definition of what counts as a “small program.”

Conclusion

So, there you will have it — a journey by writing absurdly sluggish applications. Alongside the way in which, we explored the outer edges of computation, reminiscence, and efficiency, utilizing all the things from deeply nested loops to Turing machines to a hand-inlined tetration operate.

Right here’s what shocked me:

Nested loops are sufficient.
When you simply need a quick program that halts after outliving the universe, two nested loops with 144 bytes of reminiscence will do the job. I hadn’t realized it was that easy.
Turing machines escalate quick.
The bounce from 5 to six states unleashes a dramatic leap in complexity and runtime. Additionally, the significance of beginning with zero-initialized reminiscence is clear on reflection — but it surely wasn’t one thing I’d thought-about earlier than.
Python’s int sort can kill efficiency
Sure, Python integers are arbitrary precision, which is nice. However they’re additionally immutable. Meaning each time you do one thing like x += 1, Python silently allocates a brand-new integer object—copying all of the reminiscence of x, regardless of how massive it’s. It feels in-place, but it surely’s not. This habits turns efficient-looking code right into a efficiency entice when working with giant values. To get round this, we use the gmpy2.xmpz sort—a mutable, arbitrary-precision integer that enables true in-place updates.
There’s one thing past exponentiation — and it’s referred to as tetration.
I didn’t know this. I wasn’t aware of the ↑↑ notation or the concept that exponentiation may itself be iterated to kind one thing even faster-growing. It was shocking to learn the way compactly it may categorical numbers which are in any other case unthinkably giant.
And since I do know you’re asking — sure, there’s one thing past tetration too. It’s referred to as pentation, then hexation, and so forth. These are half of an entire hierarchy often known as hyperoperations. There’s even a metageneralization: methods just like the Ackermann operate and fast-growing hierarchies seize whole households of those capabilities and extra.
Writing Tetration with Specific Loops Was Eye-Opening
I already knew that exponentiation is repeated multiplication, and so forth. I additionally knew this may very well be written recursively. What I hadn’t seen was how cleanly it may very well be written as nested loops, with out copying values and with strict in-place updates.

Thanks for becoming a member of me on this journey. I hope you now have a clearer understanding of how small Python applications can run for an astonishingly very long time — and what that reveals about computation, reminiscence, and minimal methods. We’ve seen applications that halt solely after the universe dies, and others that run even longer.

Please observe Carl on In the direction of Information Science and on @carlkadie.bsky.social. I write on scientific programming in Python and Rust, machine studying, and statistics. I have a tendency to jot down about one article per 30 days.