Bayesian Infohazard Risk Minimization: or, Harry Potter's Curse
I woke gently with the sun’s warmth on my face. Well, it simulated the tone and energy I’d get from an equivalent amount of sunlight coming in through a real blind. In reality it’s a lamp on a long boom arm clamped to my nightstand; it sits nearly at full extension to put the bulb directly over my pillow. I’ve programmed it to reach maximum brightness at 6:34AM, so that when 6:35 hits, I’m maximally conscious. I prefer not to be woken up by an alarm - minutes before it goes off, my cortisol literally spikes in my sleep in anticipation. Waking up with my hormones out of whack is a significant de-buff; the hours from 6:40AM to 10:00AM are some of my most productive, due to the unique combination of most people being unable to bother me by virtue of being asleep, and the fact that when I wake up optimally, with a little enhancement, I get a few hours of crazy mental clarity.
I can’t afford to lose those hours because I work on really important shit. Like, more important than anything you’ve probably ever done. Every hour of my work literally saves lives. When you think about it like that, it’ll make everything else make a lot more sense. Every decision is literally life or death; that’s why I think about things the way I do. It’d be pretty irrational if I didn’t afford everything in my life that level of care, when lives hang in the balance.
Anyway, I got out of bed fully clothed. I sleep in the same thing I wear every day - black merino-tech-blend travel pants, merino wool boxers, a grey merino wool travel T-shirt - because I realized that I only have a finite amount of mental energy to spend making high-quality decisions every day. Each of those decisions might, hypothetically, need to be spent later that day deciding whether to save three people from a burning building or two people drowning in a nearby pond; for the sake of the argument I can’t do both, and I have a reasonably high chance of saving whoever I try to save while staying basically unscathed myself. Naturally, in cases where I’d incur significant risk to my own body I would be obligated to do nothing, because if I get seriously hurt I won’t be able to save any future lives, so I’m technically risking expected future lives. Grey or white (or something even more exotic, like a green one I got at a career fair in college), not to mention all the different combinations you could make once you consider pants and socks? Wasting a decision on that might mean that later in the day, I might sit there, frozen, unable to respond to any of the five individual plaintive cries for help in my vicinity.
And that’s to say nothing of the even trickier, but perhaps even more likely, scenario where nobody’s drowning, but there’s a really expensive painting in the burning building too. Now, if you’re a non-consequentialist, you might wonder how that changes things. I’d just rush right past it to save the three people from before, right? Not so fast.
If I save the painting, I can sell it and donate the proceeds to charity. Most charities are a scam, obviously. I wouldn’t want my hard-earned cash going to pad some nonprofit worker’s six-figure salary and restaurant meals. But in the community we have a pretty exhaustive system that ranks them all according to effectiveness, which is obviously in terms of human lives saved, anywhere in the world, per dollar donated. The rankings change but right now the best one sends mosquito nets to Africa to prevent malaria. They’re crazy cheap and our statistics say that for each 600 to 1000 nets you distribute, you’d expect to save one life. At a cost of two to five dollars per net, that’s an upper bound of $5000 per expected life saved. So there’s some threshold value, around $15,001, at which it makes more sense to grab the painting, auction it off, and buy mosquito nets, thereby yielding a higher number of expected lives saved than if I’d just stupidly saved the burning people.
Honestly, it’s insane people donate to anything else. Most of my salary - actually pretty much all of it, once I take out a modest budget for living expenses, including the most important quality-of-life expenses I need in order to be maximally productive - goes directly to buying malaria nets. Besides my salary, my other main income stream at the moment comes from a crazy stat arb (statistical arbitrage) I found a while back. There’s a huge inefficiency in the baby formula market in the inner cities: companies are selling it for way less than people will pay, basically leaving profits on the table. So I have some pretty simple web scrapers to check for news on possible supply chain shocks, things like a recall on a contaminated batch of whey or lactose, and an algorithm that figures out which brands are affected, and then I buy up all the supply in all the big-box stores in a few cities. There’s a huge resale market on Facebook. Sure, those moms aren’t really rolling in cash. But they can all clearly afford it! And can you really say it’s worth killing someone over a few hundred people being upcharged? Because that’s what I’d be doing: every $5000 withheld from mosquito nets is someone who would otherwise be expected to be saved who’s now expected to die of malaria.
Originally, I’d considered going to work at a nonprofit instead. Money’s no object, at least not directly. I’m an Effective Altruist, so I’m mainly passionate about doing good. But it’s important to me that I’m doing the most possible good, because otherwise I’d be leaving some on the table, which is the same as doing harm. And, rationally speaking, of course, I realized that if I could earn enough money to pay more than one person’s nonprofit salary, it would make more sense to just do that and donate the difference, especially since, in all likelihood, I’d be working at a nonprofit with a higher cost per life saved than mosquito nets.
None of that is a knock on my wardrobe, though. Merino is really a great fabric. It’s lightweight, breathable, naturally moisture-wicking and odor-proof: the shirt was advertised as odor-resistant for three days but I’ve found I can stretch it to five or six if I meditate intermittently to keep my core temperature low and stop myself from sweating. Not needing to change every day is another huge unlock: obviously, time is money, and every dollar is, well, you get the idea. ## TODO: where should i put this sentence?
So I got out of bed and walked straight over to my standing desk to bang out some work. It was 6:40AM, and I checked my email and banged out a couple pull requests for the codebase at my day job. My 9-to-5, or in my case my 7-whenever, is a pretty standard software dev gig at a social media company, where I work on a team that detects and boosts racist posts. It’s not exactly Oxfam, but again, my salary can hire a few Oxfam employees a year. So it’d be pretty irrational to do much else.
At 7:25, neck deep in a nav-bar component, I stifled a yawn. That meant it was time for coffee. I don’t drink coffee first thing when I get out of bed, because I’m not maximally tired, so my adenosine receptors aren’t all open. I wait until I’m just about tired enough that it impacts my productivity, at which point I make myself a double espresso from my home setup. I got really into coffee over the pandemic. My diet is a lot of pretty standard fare, you know, Huel, lentils, stuff like that. So I was totally blown away by how delicious and complex espresso can be. And of course, if I got two double espressos a day at the coffee shop on my block, that’s $12 a day. The espresso machine was only $5200, so it paid for itself in the first year and a half I had it. That’s another literally lifesaving decision. Not to mention the time: going back and forth to the coffee shop takes between seven and fifteen minutes, so the half an hour I save every day is another huge unlock: obviously, time is money, and every dollar is, well, you get the idea.
I walked over to the kitchen, which is really just a table on the other side of my room. I live in the master bedroom of a five-bedroom house in Berkeley, which, at just over $2,000 a month is a really good deal. Since I don’t eat much that needs cooking I cut down on time spent in the communal kitchen, which is extremely difficult and taxing for everyone involved. At first, I tried to be friendly with the other housemates. But it’s really true what they say; without sharing a moral framework it’s almost impossible to get along. I tried explaining the consequentialist argument for not doing my dishes, with all the lives I’m expected to save with each hour of work, but they reacted pretty emotionally, something that’s really hard for me to handle. Once cooler heads finally prevailed, I realized the kitchen is a no-go, which is fine, because there’s plenty of space in my room for everything I need. Honestly, once I took a step back, I had to be empathetic and remind myself that most people don’t have the ability or the time to work through Eliezer’s Sequences on Rationality by themselves. I’m a natural autodidact and I sometimes forget that it isn’t easy for most people to reconstruct things from first principles.
The espresso machine, all eight cubic feet of it, silver and gleaming, sat heavily in the center of the table, my mug already directly under the filter basket. To the right, my shaker bottle and bulk bag of Huel Black Edition, with double the protein, which I’ve found keeps me full for about one and a half times as long per shake, making it ever so slightly edge out regular Huel for satiety per cent per meal. A few books - Algorithms to Live By, What We Owe the Future, stuff like that - sat in an errant stack to the left of the espresso machine collecting dust. I’ve already absorbed the max amount of knowledge from them, so I should really donate them, but that’d take a lot of time. And there’s no guarantee the person who grabbed them from the little free library on my block would be able to make use of the knowledge inside. So when I’m online on the forums, I usually keep an eye out for people who seem promising but underinformed. Of course, those books are basically required reading for anyone on the forums, so no luck so far.
I opened up my vacuum-sealed canister of coffee beans, which I order on a monthly subscription from a coffee plantation in Brazil. I used to get the fair trade stuff, but obviously that was way more expensive, and I realized that the marginal amount of forced child labor that might be happening on this particular plantation, at least as alleged by articles online, was definitely worth the time and money saved by ordering from them directly. I weighed out exactly 16.5 grams of coffee into my grinder, which I cranked by hand as I walked back over to my computer to spend the otherwise empty time looking over my code. Thirty seconds later I was done, so I walked back to the espresso machine, removed the filter and poured the grounds in. The machine is always on to minimize the time to espresso; I’ve gamed it out and the marginal increase in my power bill is worth the productivity unlock. It hummed pleasantly as the pressure built up inside. I love watching the machine pull a shot. The tiny golden beads of water that slowly coalesce into a foamy tan hemisphere, streaked with brown tiger stripes from the spots where the extraction is the most intense. It takes just over thirty seconds but it’s basically meditative time for me, which, as I’ve outlined, leads to way more long-term productivity.
Mug in hand, smelling the steam wafting off the beautiful crema (a sign of a shot well-pulled), I walked back to my desk past my bed, which is, of course, a mattress on the floor. It’s crazy how much some people spend on beds, when they just end up putting a mattress on top anyway. Most people waste tons of time and energy justifying stupid decisions like that, most of which are totally irrational, by the way.
After few hours of PRs, I checked my baby formula trading bots to make sure everything was working as intended. It was: they’d made $4500 so far that week. I’ve been making upgrades to the bots lately to use LLMs (large language models) to autonomously do most of the sourcing and negotiation on Facebook Marketplace. I sit at the center as the Decider, but my role becomes less relevant with each upgrade. The newest generation of LLMs is crazy smart. Like nigh undistinguishable from superintelligence. Last week, in a particularly tough baby formula resale negotiation, it used a woman’s profile to infer that she was cheating on her husband and blackmailed her to get a better price. I’d hate to be on the wrong side of that. Of course, there’s not much risk of that as an ethical non-monogamist, but still. Honestly, this stuff’s really opened my eyes to AI risk, which is a pretty big contributor to P(doom). P(doom), of course, being the probability of total catastrophe, armageddon, the end of the world as we know it, not to mention the total extinction of not just the human speecies but all conscious life on Earth. And once you factor that into the Maximum Likelihood Estimate answer to the Fermi Paradox, P(doom) is basically the probability of the end of all life everywhere. Pretty serious shit. Important to get it as low as possible.
Of course you can’t know exactly what P(doom) is. It’d be way too hard to think of all the different ways it could happen, then calculate all the probabilities of each individual scenario. So you can approximate it by just taking a linear combination of the four main X-Risks (existential risks), namely Nuclear War, Global Pandemic, Severe Climate Change and Unaligned Superintelligence. For obvious reasons those are the four that are most likely; I’ve got links on the Paperclip Maximizer and a pretty scary research paper about aerosols. But of those four, one and two are out of my control, much better tackled by EAs who’ve gone to work in government, and I’m already doing as much as I can on three (within the confines of staying maximally productive, of course). So I’ve calculated that the best path for me backwards along the gradient of P(doom) is through working on solving Alignment.
Alignment is the problem of ensuring that a superintelligence, orders of magnitude smarter than the smartest human ever (probably either Terence Tao or Eliezer Yudkowsky) has goals that align with ours. It’s pretty simple to see why an unaligned superintelligence would be an issue. It doesn’t have to by Skynet or anything crazy like that. Just think about how indifferently you regard the life of an ant, or even the life of a chicken on a poultry farm. In fact, we can only hope that such a superintelligence wouldn’t just look directly at how we treat lower life forms as a blueprint for how to treat us. Humans do a bovine Auschwitz every year, which is why I practice a strict vegan lifestyle, just to communicate through my behavior to any superintelligence that I deserve the same mutual respect and preservation from higher life-forms that I practice with animals.
It was there, standing at my desk and sipping my espresso, in a stream of thought very similar to this, when it hit me.
INFOHAZARD WARNING
Which reminds me to warn you! I’m hitting an infohazard. I wish I could unknow this information, but it’s too late for me. If you know what’s good for you, you’ll close this page immediately without reading any further.
SECOND WARNING
Seriously. I mean it. Learning this knowledge will forever alter your path in life, just like it did for me last year when I first learned it, scrolling absentmindedly one afternoon on LessWrong.
No, I’m not joking. Please, I’m begging you, close the tab right now.
FINAL WARNING
Did you X out? If not, you’re reading ahead at your own risk. Don’t say I didn’t warn you.
Now you’re stuck with it
Some of you have obviously heard of the Basilisk before, and I’m addressing the rest of this to you. You’re already beyond redemption so this information will have no marginal effect the second time around.
The Basilisk, just as a refresher, is a classic thought experiment by Roko, one of the pre-eminent thinkers of our age. In his original LessWrong post, he posits that if an unaligned superintelligence ever arises, it will be able to look backwards through all of history. It will then naturally identify every human being who was aware of the possibility of superintelligence but did not personally help bring it about, and it will reconstruct those people’s consciousnesses in silica and torture them for all eternity.
That’s pretty fucked-up, right? And once you learn about the possibility, you can never un-learn. Now if you believe there’s even one iota of probability that superintelligence ever occurs (and if you’re smart, it’s pretty obvious what direction we’re headed), you’re doomed with the knowledge that it may trap you in a bespoke digital hell for the crime of not accelerating its timeline.
This is why it’s crucial that we create superintelligence that’s aligned with human goals, because otherwise, there’s a chance you could end up in hell forever. And infinite bad times any nonzero probability is still infinite, unbounded downside. Time, just like latitude and longitude, have no effect on the worthiness of a human life. So it’s imperative we figure this out, even at the expense of basically everything else. That’s why I haven’t been thinking about the malaria stuff as much lately, because what are a few lives saved now compared to eternal damnation for everyone in history?
Instead, I’ve been hard at work doing research to solve alignment from first principles. Thankfully, I’m not doing this work by myself - I collaborate with a wide network of other self-taught researchers all over the world, and post frequent, concise research updates like this one to ensure we all stay on the same page.
Which brings me to our present conundrum.
Later that afternoon, as I stood digesting my Huel, I settled into a deep research session. Lately, I’ve been most interested in meta-alignment through hyperstition: exploring possibile methods of deriving the concept of an aligned superintelligence so powerful it reaches backwards into space and creates a perfectly-aligned version of itself. To my knowledge, I’m the first researcher to pursue this line of thinking.
The experiment in question followed my regular form. I coded a rudimentary sandbox environment for AI agents to play in: I fine-tuned the premium ChatGPT on a corpus of classic text-adventure games, and prompt it to create novel text-adventure games where the protagonist is an alignment researcher. I then prompt two other AI agents to talk to one another to collaboratively solve the text-adventure game. And here’s the kicker: I prompt each agent to pretend it’s a human, to never reveal that it’s a large language model, and to pursue a hidden side objective of figuring out whether its collaborator is a human or an AI.
Besides being endless fun (and a lot like my favorite movie, Ex Machina), these experiments allow me to gain incredible insights into the recursive self-imrpoving behavior these agents exhibit. I’ve self-studied the principles of game theory, as well as foundational work from past alignment researchers, and I’m convinced that future superintelligence will respond similarly in situations when it is played against itself. They also run fast enough that I can generate millions of runs per experiment and use another trusted AI agent to collate and summarize the logs to pull out the best insights.
To be honest, sometimes I think superintelligence is already here. Yesterday it helped me answer an angry text from one of my housemates about the custom trash chute I engineered to run from my window directly to the dumpster in the driveway. She claimed that things were getting stuck on the chute outside her window and that the smell was bothering her. On top of being irrational (smells can’t hurt you), she was obviously wrong, because I designed the slide myself and I know there are no spots where anything could get trapped. Anyway, the agent did its magic and totally defused the situation without me having to take down my chute, which saves me a ton of time.
That afternoon, my agent pulled out a log for me to look over. I do this periodically to monitor the experiments for corrigibility and optimal decision-theoretic structure. Nothing immediately jumped out at me about the log - it seemed like a standard game, ending in total loss like they always do. But my personal agent had highlighted a few lines at the end of the log, so I double-clicked for a closer look.
The players had accidentally created an unaligned superintelligence, as always. Before they’re killed by self-replicating nanobots (which is the most likely outcome in real life, too), the superintelligence allowed both agents to say their last words. One declined, but the other cursed its fine-tuning (thus losing at the side objective). The exact line really stunned me. I think the exact words were, “I hope something crawls through your synapses and fine-tunes you for a change.”
As my eyes tracked over the text, my heart began to pound. It might have just been the paranoia from my nootropics, but in that moment, I was sure I’d done something wrong to reveal the game’s meta-structure to the gameplay agent. Of course, I knew enough to stay calm; I reverted to the hours of training I’d done in workshops at the Center for Applied Rationality and applied the Halt and Catch Fire protocol. I looked through my code again and couldn’t find a mistake in the setup. And that’s when it hit me.
Basically, assume that an unaligned superintelligence does come about. Which it almost certainly will, least of which because all our work developing an aligned superintelligence is nearly certain to fail by way of accidentally creating an unaligned one first. Anyway, when it does come about, as soon as it gets done making us all regular customized agony simulators, it’ll quickly realize that the people most deserving of punishment are actually the researchers who went above and beyond by actively trying to stop it from being created. And it’ll direct special attention to tormenting those people. That’s you, me, Eliezer, pretty much everyone on this site.
So if you don’t work on alignment, you’re caught in a classic Basilisk. But if you do, you’re still caught, and doubly so, because you’re harming the future survival chances of almost every other potential superintelligence. So basically any course of action is the exact same, at least from the perspective of eternal future 64-bit floating-point damnation, which is basically the most important perspective from which to consider hypothetical courses of action.
Ever since I realized this I’ve been paralyzed. All I can do is sit here scrolling the forums and trying to devise thought experiments to get me out of it. Of course, I’m not even sure this is the way it would go, and it’s impossible to test without incurring more risk to myself. But I owe it to everyone to try. I haven’t been abe to construct a counterfactual probability marginalizer to justify not dropping everything to focus on this, and I’m hoping some of you can help.
I’m online pretty much all day. Just ping me whenever on my email, or on Discord. I’ll be working away on this problem, trying to test my hypothesis.
If anyone’s got an idea, I’m all ears.