[go: up one dir, main page]

Politics is a hard subject to discuss rationally. LessWrong has a developed a unique set of norms and habits around politics. Our aim to allow for discussion to happen (when actually important) while hopefully avoiding many pitfalls and distractions. 

Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
BuckΩ193323
1
AI safety people often emphasize making safety cases as the core organizational approach to ensuring safety. I think this might cause people to anchor on relatively bad analogies to other fields. Safety cases are widely used in fields that do safety engineering, e.g. airplanes and nuclear reactors. See e.g. “Arguing Safety” for my favorite introduction to them. The core idea of a safety case is to have a structured argument that clearly and explicitly spells out how all of your empirical measurements allow you to make a sequence of conclusions that establish that the risk posed by your system is acceptably low. Safety cases are somewhat controversial among safety engineering pundits. But the AI context has a very different structure from those fields, because all of the risks that companies are interested in mitigating with safety cases are fundamentally adversarial (with the adversary being AIs and/or humans). There’s some discussion of adapting the safety-case-like methodology to the adversarial case (e.g. Alexander et al, “Security assurance cases: motivation and the state of the art”), but this seems to be quite experimental and it is not generally recommended. So I think it’s very unclear whether a safety-case-like structure should actually be an inspiration for us. More generally, I think we should avoid anchoring on safety engineering as the central field to draw inspiration from. Safety engineering mostly involves cases where the difficulty arises from the fact that you’ve built extremely complicated systems and need to manage the complexity; here our problems arise from adversarial dynamics on top of fairly simple systems built out of organic, hard-to-understand parts. We should expect these to be fairly dissimilar. (I think information security is also a pretty bad analogy--it’s adversarial, but like safety engineering it’s mostly about managing complexity, which is not at all our problem.)
Fabien RogerΩ25444
2
I listened to the book Protecting the President by Dan Bongino, to get a sense of how risk management works for US presidential protection - a risk that is high-stakes, where failures are rare, where the main threat is the threat from an adversary that is relatively hard to model, and where the downsides of more protection and its upsides are very hard to compare. Some claims the author makes (often implicitly): * Large bureaucracies are amazing at creating mission creep: the service was initially in charge of fighting against counterfeit currency, got presidential protection later, and now is in charge of things ranging from securing large events to fighting against Nigerian prince scams. * Many of the important choices are made via inertia in large change-averse bureaucracies (e.g. these cops were trained to do boxing, even though they are never actually supposed to fight like that), you shouldn't expect obvious wins to happen; * Many of the important variables are not technical, but social - especially in this field where the skills of individual agents matter a lot (e.g. if you have bad policies around salaries and promotions, people don't stay at your service for long, and so you end up with people who are not as skilled as they could be; if you let the local police around the White House take care of outside-perimeter security, then it makes communication harder); * Many of the important changes are made because important politicians that haven't thought much about security try to improve optics, and large bureaucracies are not built to oppose this political pressure (e.g. because high-ranking officials are near retirement, and disagreeing with a president would be more risky for them than increasing the chance of a presidential assassination); * Unfair treatments - not hardships - destroy morale (e.g. unfair promotions and contempt are much more damaging than doing long and boring surveillance missions or training exercises where trainees actually feel the pain from the fake bullets for the rest of the day). Some takeaways * Maybe don't build big bureaucracies if you can avoid it: once created, they are hard to move, and the leadership will often favor things that go against the mission of the organization (e.g. because changing things is risky for people in leadership positions, except when it comes to mission creep) - Caveat: the book was written by a conservative, and so that probably taints what information was conveyed on this topic; * Some near misses provide extremely valuable information, even when they are quite far from actually causing a catastrophe (e.g. who are the kind of people who actually act on their public threats); * Making people clearly accountable for near misses (not legally, just in the expectations that the leadership conveys) can be a powerful force to get people to do their job well and make sensible decisions. Overall, the book was somewhat poor in details about how decisions are made. The main decision processes that the book reports are the changes that the author wants to see happen in the US Secret Service - but this looks like it has been dumbed down to appeal to a broad conservative audience that gets along with vibes like "if anything increases the president's safety, we should do it" (which might be true directionally given the current state, but definitely doesn't address the question of "how far should we go, and how would we know if we were at the right amount of protection"). So this may not reflect how decisions are done, since it could be a byproduct of Dan Bongino being a conservative political figure and podcast host. 
Here's something that I'm surprised doesn't already exist (or maybe it does and I'm just ignorant): Constantly-running LLM agent livestreams. Imagine something like ChaosGPT except that whoever built it just livestreams the whole thing and leaves it running 24/7. So, it has internet access and can even e.g. make tweets and forum comments and maybe also emails. Cost: At roughly a penny per 1000 tokens, that's maybe $0.20/hr or five bucks a day. Should be doable. Interestingness: ChaosGPT was popular. This would scratch the same itch so probably would be less popular, but who knows, maybe it would get up to some interesting hijinks every few days of flailing around. And some of the flailing might be funny. Usefulness: If you had several of these going, and you kept adding more when new models come out (e.g. Claude 3.5 sonnet) then maybe this would serve as a sort of qualitative capabilities eval. At some point there'd be a new model that crosses the invisible line from 'haha this is funny, look at it flail' to 'oh wow it seems to be coherently working towards its goals somewhat successfully...' (this line is probably different for different people; underlying progress will be continuous probably) Does something like this already exist? If not, why not?
I frequently find myself in the following situation: Friend: I'm confused about X Me: Well, I'm not confused about X, but I bet it's because you have more information than me, and if I knew what you knew then I would be confused. (E.g. my friend who know more chemistry than me might say "I'm confused about how soap works", and while I have an explanation for why soap works, their confusion is at a deeper level, where if I gave them my explanation of how soap works, it wouldn't actually clarify their confusion.) This is different from the "usual" state of affairs, where you're not confused but you know more than the other person. I would love to have a succinct word or phrase for this kind of being not-confused!
I do a lot of "mundane utility" work with chat LLMs in my job[1], and I find there is a disconnect between the pain points that are most obstructive to me and the kinds of problems that frontier LLM providers are solving with new releases. I would pay a premium for an LLM API that was better tailored to my needs, even if the LLM was behind the frontier in terms of raw intelligence.  Raw intelligence is rarely the limiting factor on what I am trying to do.  I am not asking for anything especially sophisticated, merely something very specific, which I need to be done exactly as specified. It is demoralizing to watch the new releases come out, one after the next.  The field is overcrowded with near-duplicate copies of the same very specific thing, the frontier "chat" LLM, modeled closely on ChatGPT (or perhaps I should say modeled closely on Anthropic's HHH Assistant concept, since that came first).  If you want this exact thing, you have many choices -- more than you could ever need.  But if its failure modes are problematic for you, you're out of luck. Some of the things I would pay a premium for: * Tuning that "bakes in" the idea that you are going to want to use CoT in places where it is obviously appropriate.  The model does not jump to conclusions; it does not assert something in sentence #1 and then justify it in sentences #2 through #6; it moves systematically from narrow claims to broader ones.  This really ought to happen automatically, but failing that, maybe some "CoT mode" that can be reliably triggered with a special string of words. * Getting chat LLMs to do 0-shot CoT is still (!) kind of a crapshoot.  It's ridiculous.  I shouldn't have to spend hours figuring out the magic phrasing that will make your model do the thing that you know I want, that everyone always wants, that your own "prompting guide" tells me I should want. * Really good instruction-following, including CoT-like reviewing and regurgitation of instruction text as needed. * Giving complex instructions to chat LLMs is still a crapshoot.  Often what one gets is a sort of random mixture of "actually following the damn instructions" and "doing something that sounds, out of context, like a prototypical 'good' response from an HHH Assistant -- even though in context it is not 'helpful' because it flagrantly conflicts with the instructions." * "Smarter" chat LLMs are admittedly better at this than their weaker peers, but they are still strikingly bad at it, given how easy instruction-following seems in principle, and how capable they are at various things that seem more difficult. * It's 2024, the whole selling point of these things is that you write verbal instructions and they follow them, and yet I have become resigned to the idea that they will just ignore half of my instructions half of the time.  Something is wrong with this picture! * Quantified uncertainty.  Some (reliable) way of reporting how confident they are about a given answer, and/or how likely it is that they will be able to perform a given task, and/or similar things. * Anthropic published some promising early work on this back in mid-2022, the P(IK) thing.  When I first read that paper, I was expecting something like that to get turned into a product fairly soon.  Yet it's mid-2024, and still nothing, from any LLM provider. * A more humanlike capacity to write/think in an exploratory, tentative manner without immediately trusting what they've just said as gospel truth.  And the closely related capacity to look back at something they've written and think "hmm, actually no, that argument didn't work," and then try something else.  A certain quality of looseness, of "slack," of noncommittal just-trying-things-out, which is absolutely required for truly reliable reasoning and which is notably absent in today's "chat" models. * I think this stuff is inherently kind of hard for LLMs, even ignoring the distortions introduced by instruction/chat tuning.  LLMs are trained mostly on texts that capture the final products of human thought, without spelling out all the messy intermediate steps, all the failed attempts and doubling back on oneself and realizing one isn't making any sense and so on.  That tends to stay inside one's head, if one is a human. * But, precisely because this is hard for LLMs (and so doesn't come for free with better language modeling, or comes very slowly relative to other capabilities), it ought to be attacked as a research problem unto itself. I get the sense that much of this is downstream from the tension between the needs of "chat users" -- people who are talking to these systems as chatbots, turn by turn -- and the needs of people like me who are writing applications or data processing pipelines or whatever. People like me are not "chatting" with the LLM.  We don't care about its personality or tone, or about "harmlessness" (typically we want to avoid refusals completely).  We don't mind verbosity, except insofar as it increases cost and latency (very often there is literally no one reading the text of the responses, except for a machine that extracts little pieces of them and ignores the rest). But we do care about the LLM's ability to perform fundamentally easy but "oddly shaped" and extremely specific business tasks.  We need it to do such tasks reliably, in a setting where there is no option to write customized follow-up messages to clarify and explain things when it fails.  We also care a lot about cost and latency, because we're operating at scale; it is painful to have to spend tokens on few-shot examples when I just know the model could 0-shot the task in principle, and no, I can't just switch to the most powerful available model the way all the chat users do, those costs really add up, yes GPT-4-insert-latest-suffix-here is much cheaper than GPT-4 at launch but no, it is still not worth the extra money at scale, not for many things at least. It seems like what happened was: * The runaway success of ChatGPT caused a lot of investment in optimizing inference for ChatGPT-like models * As a result, no matter what you want to do with an LLM, a ChatGPT-like model is the most cost-effective option * Social dynamics caused all providers "in the game" to converge around this kind of model -- everyone wants to prove that, yes, they are "in the game," meaning they have to compare their model to the competition in an apples-to-apples manner, meaning they have to build the same kind of thing that the competition has built * It's mid-2024, there are a huge number of variants of the Anthropic HHH Assistant bot with many different options at each of several price/performance points, and absolutely nothing else * It turns out the Anthropic HHH Assistant bot is not ideal for a lot of use cases, there would be tremendous value in building better models for those use cases, but no one (?) is doing this because ... ??  (Here I am at a loss.  If your employer is doing this, or wants to, please let me know!) 1. ^ And I work on tools for businesses that are using LLMs themselves, so I am able to watch what various others are doing in this area as well, and where they tend to struggle most.

Popular Comments

Recent Discussion

I've recently been reading a lot of science fiction. Most won't be original to fans of the genre, but some people might be looking for suggestions, so in lieu of full blown reviews here's super brief ratings on all of them. I might keep this updated over time, if so new books will go to the top.

A deepness in the sky (Verner Vinge)

scifiosity: 10/10
readability: 8/10
recommended: 10/10

A deepness in the sky excels in its depiction of a spacefaring civilisation using no technologies we know to be impossible,  a truly alien civilisation, and it's brilliant treatment of translation and culture.

A fire upon the deep (Verner Vinge)

scifiosity: 8/10
readability: 9/10
recommended: 9/10

In a fire upon the deep, Vinge allows impossible technologies and essentially goes for a slightly more fantasy theme. But his...

Phib10

Agree that this is a cool list, thanks, excited to come back to it.

I just read Three Body Problem and liked it, but got the same sense where the end of the book lost me a good deal and left a sour taste. (do plan to read sequels tho!)

2Yair Halberstadt
Thanks, really appreciate the feedback! Maybe I'll give The Three Body Problem another chance.
1Michael Roe
A Deepness in the Sky feels like the author know he can't write female characters, but knows that women ought to feature in the plot, so is going really out of the way to avoid showing their viewpoint ... at least, in a direct way.   There are a lot of surprise plot twists, so its hard to expand on this without plot spoliers.   Oh (not a spoiler) the second narrator is obviously not being entirely truthful. The book gets a lot better when you realise they;re supposed  to be read as an unreliable narrator. (So, sure, its communications intercept of an alien species  approximately rendered into English by an intelligence analyst who has been enslaved by Space Nazis ... someone, somewhere, might be lying here...)
2Yair Halberstadt
That's totally a spoiler :-), but for me it was one of the most brilliant twists in the book. You have this stuff that feels like the author is doing really poor sci-fi, and then it's revealed that the author is perfectly aware of that and is making a point about translation.
This is a linkpost for https://arxiv.org/abs/2406.11779

We recently released a paper on using mechanistic interpretability to generate compact formal guarantees on model performance. In this companion blog post to our paper, we'll summarize the paper and flesh out some of the motivation and inspiration behind our work. 

Paper abstract

In this work, we propose using mechanistic interpretability – techniques for reverse engineering model weights into human-interpretable algorithms – to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally proving lower bounds on the accuracy of 151 small transformers trained on a Max-of- task. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding. Moreover,

...

I know nothing about war except that horseback archers were OP for a long time. But from my point of view, which is blatantly uneducated when it comes to war, being a Russian soldier seems like a miserable experience. It therefore makes me wonder why 300,000 Russian soldiers are willing to risk it all in Ukraine.[1] Why don’t they desert? How does the Russian regime get so many people to fight a war when my home government is struggling to convince me to sort my trash? If the Russian regime can convince so many people to have a shit time in Ukraine, I’d argue that the West could convince these people to go live an easier life. The idea is so simple that by now I mostly wonder...

I think the Trojan Horse situation is going to be your biggest blocker, regardless of whether it's a real problem or not. At least in the US, anti-immigration talking points tend to focus on the working age military age men immigrating from a friendly country in order to get jobs. I can't imagine how strong the blowback would be if they were literally Russian soldiers.

There's also a repeated-game concern where once you do this, the incentive is for every poor country to invade its neighbors in the hopes of getting its soldiers a cushy retirement and the ab... (read more)

3Eric Neyman
I frequently find myself in the following situation: Friend: I'm confused about X Me: Well, I'm not confused about X, but I bet it's because you have more information than me, and if I knew what you knew then I would be confused. (E.g. my friend who know more chemistry than me might say "I'm confused about how soap works", and while I have an explanation for why soap works, their confusion is at a deeper level, where if I gave them my explanation of how soap works, it wouldn't actually clarify their confusion.) This is different from the "usual" state of affairs, where you're not confused but you know more than the other person. I would love to have a succinct word or phrase for this kind of being not-confused!
cubefox20

"I find soaps disfusing, I'm straight up afused by soaps"

Summary: Superposition-based interpretations of neural network activation spaces are incomplete. The specific locations of feature vectors contain crucial structural information beyond superposition, as seen in circular arrangements of day-of-the-week features and in the rich structures of feature UMAPs. We don’t currently have good concepts for talking about this structure in feature geometry, but it is likely very important for model computation. An eventual understanding of feature geometry might look like a hodgepodge of case-specific explanations, or supplementing superposition with additional concepts, or plausibly an entirely new theory that supersedes superposition. To develop this understanding, it may be valuable to study toy models in depth and do theoretical or conceptual work in addition to studying frontier models. 

Epistemic status: Decently confident that the ideas here are directionally correct. I’ve...

3Eric Winsor
This reminded me of how GPT-2-small uses a cosine/sine spiral for its learned positional embeddings embeddings, and I don't think I've seen a mechanistic/dynamical explanation for this (just the post-hoc explanation that attention can use cosine similarity to encode distance in R^n, not that it should happen this way).

Yeah this does seem like its another good example of what I'm trying to gesture at. More generally, I think the embedding at layer 0 is a good place for thinking about the kind of structure that the superposition hypothesis is blind to. If the vocab size is smaller than the SAE dictionary size, an SAE is likely to get perfect reconstruction and  by just learning the vocab_size many embeddings. But those embeddings aren't random! They have been carefully learned and contain lots of useful information. I think trying to explain the structure in... (read more)

I’ve been to two EAGx events and one EAG, and the vast majority of my one on ones with junior people end up covering some subset of these questions. I’m happy to have such conversations, but hopefully this is more efficient and wide-reaching (and more than I could fit into a 30 minute conversation).

I am specifically aiming to cover advice on getting a job in empirically-leaning technical research (interp, evals, red-teaming, oversight, etc) for new or aspiring researchers without being overly specific about the field of research – I’ll try to be more agnostic than something like Neel Nanda’s mechinterp quickstart guide but more specific than the wealth of career advice that already exists but that applies to ~any career. This also has some overlap with this excellent list...

2Tapatakt
You talk about algoritms/data structures. As I see it, this is at most a half of "programming skills". The other half that includes things like "How to program something big without going mad", "How to learn new tool/library fast enough" and "How to write good unit tests" always seemed more difficult to me. 
gw10

I agree! This is mostly focused on the "getting a job" part though, which typically doesn't end up testing those other things you mention. I think this is the thing I'm gesturing at when I say that there are valid reasons to think that the software interview process feels like it's missing important details.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

RSVP required on Meetup - please include your full name when signing up (it will ask you for it). 

 

Addy Cha from Ekkolapto will be speaking about how language influences cognitive processing and communication in animals like whales, dogs, and ants. A key question is how language influences and constrains cognition.

Ekkolapto is a "thinkubator" that runs fellowships, hackathons, and exclusive conferences. Read more at https://www.ekkolapto.org/.

More details and additional speakers to be announced soon.

1Jinyeop Song
Interested! Looking forward to meeting you all.

Great, I look forward to meeting you there!

I recently gave a talk to the AI Alignment Network (ALIGN) in Japan on my priorities for AI safety fieldbuilding based on my experiences at MATS and LISA (slides, recording). A lightly edited talk transcript is below. I recommend this talk to anyone curious about the high level strategy that motivates projects like MATS. Unfortunately, I didn't have time to delve into rebuttals and counter-rebuttals to our theory of change; this will have to wait for another talk/post.

Thank you to Ryuichi Maruyama for inviting me to speak!


Ryan: Thank you for inviting me to speak. I very much appreciated visiting Japan for the technical AI safety conference in Tokyo. I had a fantastic time. I loved visiting Tokyo; it's wonderful. I had never been before, and I was...

Nice talk! 
When you talk about the most important interventions for the three scenarios, I wanna highlight that in the case of nationalization, you can also, if you're a citizen of one of these countries nationalizing AI, work for the government and be on those teams working and advocating for safe AI.