> We don't just keep adding more words to our context window, because it would drive us mad.
That, and we also don't only focus on the textual description of a problem when we encounter a problem. We don't see the debugger output and go "how do I make this bad output go away?!?". Oh, I am getting an authentication error. Well, meaybe I should just delete the token check for that code path...problem solved?!
No. Problem very much not-solved. In fact, problem very much very bigger big problem now, and [Grug][1] find himself reaching for club again.
Software engineers are able to step back, think about the whole thing, and determine the root cause of a problem. I am getting an auth error...ok, what happens when the token is verified...oh, look, the problem is not the authentication at all...in fact there is no error! The test was simply bad and tried to call a higher privilege function as a lower privilege user. So, test needs to be fixed. And also, even though it isn't per-se an error, the response for that function should maybe differentiate between "401 because you didn't authenticate" and "401 because your privileges are too low".
> We don't see the debugger output and go "how do I make this bad output go away?!?"
In the past, I've worked with developers that do. You ask them to investigate and deal with an error message, and all they do is whatever makes the error go away. Oh, a null pointer exception is thrown? Lets wrap it in a try/catch and move on.
blauditore 7 minutes ago [-]
Or google the error message, click first link, blindly copy-paste whatever code snippet comes up and re-run code/test.
Yes, such workflows (jobs or) may become obsolete with some of the modern AI tools. Is that a bad thing? Not sure...
tyleo 18 minutes ago [-]
Agreed. I’d argue there can be a time and place for it but reaching for it as the default tool on the shelf is a hallmark of incompetence.
skydhash 21 hours ago [-]
Programmers are mostly translating business rules to the very formal process execution of the computer world. And you need to both knows what the rules means and how the computer works (or at least how the abstracted version you’re working with works). The translation is messy at first, which is why you need to revise it again and again. Especially when later rules comes challenging all the assumptions you’ve made or even contradicting themselves.
Even translations between human languages (which allows for ambiguity) can be messy. Imagine if the target language is for a system that will exactly do as told unless someone has qualified those actions as bad.
noduerme 6 hours ago [-]
Good programmers working hand in glove with good companies do much more than this. We question the business logic itself and suggest non-technical, operational solutions to user issues before we take a hammer to the code.
Also, as someone else said, consider the root causes of an issue, whether those are in code logic or business ops or some intersection between the two.
When I save twenty hours of a client's money and my own time, by telling them that a new software feature they want would be unnecessary if they changed the order of questions their employees ask on the phone, I've done my job well.
By the same token, if I'm bored and find weird stuff in the database indicating employees tried to perform the same action twice or something, that is something that can be solved with more backstops and/or a better UI.
Coding business logic is not a one-way street. Understanding the root causes and context of issues in the code itself is very hard and requires you to have a mental model of both domains. Going further and actually requesting changes to the business logic which would help clean up the code requires a flexible employer, but also an ability to think on a higher order than simply doing some CRUD tasks.
The fact that I wouldn't trust any LLM to touch any of my code in those real world cases makes me think that most people who are touting them are not, in fact, writing code at the same level or doing the same job I do. Or understand it very well.
danielrico 3 hours ago [-]
> When I save twenty hours of a client's money and my own time, by telling them that a new software feature they want would be unnecessary if they changed the order of questions their employees ask on the phone, I've done my job well.
I like to explain my work as "do whatever is needed to do as little work as possible".
Being by improving logs, improving architecture, updating logs, pushing responsibilities around or rejecting some features.
withinboredom 3 hours ago [-]
"The best programmers are lazy, or more accurately, they work hard to be as lazy as possible." -- CS101, first day
K0balt 49 minutes ago [-]
The most clever lines of code are the ones you don’t write. Often this is a matter of properly defining the problem in terms of data structure. LLMs are not at all good at seeing that a data structure is inside out and that by turning it right side in, we can fix half the problems.
More significantly though, OP seems right on to me. The basic functionality of LLMs is handy for a code writing assistant, but does not replace a software engineer, and is not ever likely too no matter how many janky accessories we bolt on. LLMs are fundamentally semantic pattern matching engines, and are only problem solvers in the context of problems that are either explicitly or implicitly defined and solved in their training data. They will always require supervision because there is fundamentally no difference between a useful LLM output and a “hallucination” except the utility rating that a human judge applies to the output.
LLMs are good at solving fully defined, fully solved problems. A lot of work falls into that category, but some does not.
shinycode 5 hours ago [-]
True and LLM have no incentive to avoid writing code. It’s even worse they are « paid » by the amount of code they generate. So default behavior is to avoid asking questions to refine the need. They thrive on blurry and imprecise prompt because in any case they’ll generate thousands of loc, regardless of the pertinence.
Many people confirmed that in their experience.
I’ve never seen an LLM step back, ask questions and then code or avoid coding. It’s by design a choice of generating the most stuff because of money.
So right now an LLM and the developer you describe here are two very different thing and an LLM will, by design, never replace you
jlcummings 4 hours ago [-]
Being effective with llm agents requires not just the ability to code or to appreciate nuance with libraries or business rules but to have the ability and proclivity of pedantry. Dad-splain everything always.
And to have boundless contextual awareness… dig a rabbit hole, but beware that you are in your own hole. At this point you can escape the hole but you have to be purposefully aware of what guardrails and ladders you give the agent to evoke action.
The better, more explicit guardrails you provide the more likely the agent is able to do what is expected and honor the scope and context you establish. If you tell it to use silverware to eat, be assured it doesn’t mean to use it appropriately or idiomatically and it will try eating soup with a fork.
Lastly don’t be afraid of commits and checkpoints, or to reject/rollback proposed changes and restate or reset the context. The agent might be the leading actor, but you are the director. When a scene doesn’t play out, try it again after clarification or changing camera perspective or lighting or lines, or cut/replace the scene entirely.
cmsj 2 hours ago [-]
I find that level of pedantry and hand-holding, to be extremely tedious and I frequently find myself just thinking fuck it, I'll write it myself and get what I want the first time.
skydhash 1 hours ago [-]
This. That’s why every programmer strive for a good architecture and write tests. When you have that and all your bug fixes and feature request are only a small amount of lines, that is pure bliss. Even if it requires hours of reading and designing. Anything is better than dumping lot of lines.
1dom 4 hours ago [-]
I think this is a fair and valuable comment. Only part I think could be more nuanced is:
> The fact that I wouldn't trust any LLM to touch any of my code in those real world cases makes me think that most people who are touting them are not, in fact, writing code at the same level or doing the same job I do. Or understand it very well.
I agree with this specifically for agentic LLM use. However, I've personally increased my code speed and quality with LLMs for sure using purely local models as a really fancy auto complete for 1 or 2 lines at a time.
The rest of your comment is good, bit the last paragraph to me reads like someone inexperienced with LLMs looking to find excuses to justify not being productive with them, when others clearly are. Sorry.
danielbln 4 hours ago [-]
I'm not sure what any of what you just wrote has to do with LLMs. If you use LLMs to rubber duck or write tests/code, then all of the things you mentioned should still apply. That last logical leap, the fact that _you_ wouldn't trust LLM to touch your code means that people who do aren't at the same level as you is a fallacy.
gxs 3 hours ago [-]
To be honest you sound super defensive, not just in a classic programmer when someone invades on their turf sort of way, but also in the classic way people who are reluctant to accept a new technology
This sentiment of, a human will always be needed, there’s no replacement for human touch, the stakes are too high, is as old as time
You just said, quite literally, that people leveraging LLMs to code are not doing it at your level - that’s borders on hubris
The fact of the matter is that like most tools, you get out of AI what you put into it
I know a lot of engineers and this pride, this reluctance to accept the help is super common
The best engineers on the other hand are leveraging this just fine, just another tool for them that speeds things up
geraldwhen 51 minutes ago [-]
Hubris? The offshore team submitting 2000 line nonsense PRs from AI is reality.
We’re living it. We see it every day. The business leaders cannot be convinced that this isn’t making less skilled developers more productive.
42 minutes ago [-]
mgaunard 20 hours ago [-]
That's not quite true; programmers adjust what the business rules should be as they write code for it.
Those rules are also very fuzzy and only get defined more formally by the coding process.
nicbou 6 hours ago [-]
How does that work in an AI-supported development process? I'm a bit out of the loop since I left the industry. Usually there is a lot of back and forth over things like which fields go in a form, and whether asking for a last name will impact the conversion rate and so on.
serpix 2 hours ago [-]
Well the AI will just steamroll through and will therefore go out of rails just like a junior dev on a coding binge.
area51org 18 hours ago [-]
That seems very dependent on which company you work for. Many would not grant you that kind of flexibility.
hansifer 14 hours ago [-]
At their peril, because any set of rules, no matter how seemingly simple, has edge cases that only become apparent once we take on the task of implementing them at the code level into a functioning app. And that's assuming specs have been written up by someone who has made every effort to consider every relevant condition, which is never the case.
tharkun__ 10 hours ago [-]
And in the example of "why" this 401 is happening that's another one of those. The spec might have said to return a 401 for both not being authenticated and for not having enough privileges.
But that's just plain wrong and a proper developer would be allowed to change that. If you're not authenticating properly, you get a 401. That means you can't prove you're who you say you are.
If you are past that, i.e. we know that you are who you say you are, then the proper return code is 403 for saying "You are not allowed to access what you're trying to access, given who you are".
Which funnily enough seems to be a very elusive concept to many humans as well, never mind an LLM.
10 hours ago [-]
motorest 5 hours ago [-]
> That seems very dependent on which company you work for. Many would not grant you that kind of flexibility.
It really boils down to what scenario you have in mind. Developers do interact with product managers and discussions do involve information flowing both ways. Even if a PM ultimately decides what the product should do, you as a developer have a say in the process and outcome.
Also, there are always technological constraints, and some times even practical constraints are critical. A PM might want to push this or that feature but if it's impossible to deliver on a specific deadline they have no alternative to compromise, and the compromise is determined by what developers call out.
gregors 17 hours ago [-]
The majority of places I've worked don't adjust business rules on the fly because of flexibility. They do it because "we need this out the door next month". They need to ship and ship now. Asking clarifying questions at some of these dumpster fires is actually looked down upon, much less taking the time to write or even informally have a spec.
benreesman 18 hours ago [-]
This is a very common statement but doesn't match my experience at all, unless you expand "business rules" to mean "not code already".
There's plenty of that work, and it goes by many names ("enterprise", others).
But lots and lots and lots of programmers are concerned with using computers for computations: making things with the new hardware that you couldnt with the old hardware being an example. Embedded, cryptography, graphics, simulation, ML, drones and compilers and all kinds of stuff are much more about resources than business logic.
You can define up business logic to cover anything I guess, but at some point its no longer what you meant by that.
physicsguy 20 hours ago [-]
Yes although many software engineers try as hard as possible to avoid learning what the business problem is. In my experience though those people never make great engineers.
trimbybj 20 hours ago [-]
Often those of us that do want to learn what the business problem is are not allowed to be involved in those discussions, for various reasons. Sometimes it's "Oh we can take care of that so you don't have to deal with it," and sometimes it's "Just build to this design/spec" and they're not used to engineers (the good ones) questioning things.
marcus_holmes 9 hours ago [-]
"Just shut up and push the nerd-buttons, nerd."
I went and got an MBA to try and get around this. It didn't work.
lisbbb 7 hours ago [-]
I had a professor in grad school, Computer Engineering, that begged me not to get an MBA--he had worked in industry, particularly defense, and had a very low opinion of MBAs. I tend to agree nowadays. I really think the cookie-cutter "safe" approach that MBA types take, along with them maximizing profits using data science tools, has made the USA a worse place overall.
ruslan_sure 5 hours ago [-]
Understanding the business problem or goal is actually the context for correctly writing code. Without it, you start acting like an LLM that didn't receive all the necessary code to solve a task.
When a non-developer writes code with an LLM, their ability to write good code decreases. But at the same time, it goes up thanks to more "business context."
In a year or two, I imagine that a non-developer with a proper LLM may surpass a vanilla developer.
lisbbb 7 hours ago [-]
My problem was that the business problems were so tough on most of the gigs I had that it was next to impossible to build a solution for them! Dealing with medical claims in real time at volume was horrendous.
tempodox 18 hours ago [-]
Going by your first sentence, you must be working in a very bad environment. How can anyone solve a problem they don't understand?
skydhash 16 hours ago [-]
Hint: They don't
They usually code for the happy path, and add edge cases as bugs are discovered in production. But after a while both happy path and edge cases blend into a ball of mud that you need the correct incantation to get running. And it's a logic maze that contradict every piece of documentation you can find (ticket, emails). Then it quickly become something that people don't dare to touch.
pjmlp 19 hours ago [-]
Usually this only happens to those doing product development.
When the employer business isn't shipping software, engineers have no other option than actually learn the business as well.
sodapopcan 20 hours ago [-]
I guess that really is a thing, eh? That concept is pretty foreign to me. How on earth are you supposed to do domain modelling if you don't understand the domain?
victorbjorklund 19 hours ago [-]
How many % of software is domain modeled? Must me a small minority.
alexanderchr 14 hours ago [-]
I’d say all (useful) software is modelling some domain.
pjmlp 19 hours ago [-]
Plenty if developed under consulting contract.
nonethewiser 20 hours ago [-]
>Software engineers are able to step back, think about the whole thing, and determine the root cause of a problem.
Agree strongly, and I think this is basically what the article is saying as well about keeping a mental model of requirements/code behavior. We kind of already knew this was the hard part. How many times have you heard that once you get past junior level, the hard part is not writing the code? And that It's knowing what code to write? This realization is practically a right of passage.
Which kind of begs the question for what the software engineering job looks like in the future. It definitely depends on how good the AI is. In the most simplistic case, AI can do all the coding right now and all you need is a task issue. And frankly probably a user written (or at least reviewed, but probably written) test. You could make the issue and test upfront and farm out the PR to an agent and manually approve when you see it passed the test case you wrote.
In that case you are basically PM and QA. You are not even forming the prompt, just detailing the requirements.
But as the tech improves can all tasks fit into that model? Not design/architecture tasks - or at least without a new task completion model than described above. The window will probably grow, but its hard to imagine that it will handle all pure coding tasks. Even for large tasks that theorhetically can fit into that model, you are going to have to do a lot of thinking and testing and prototyping to figure out the requirements and test cases. In theory you could apply the same task/test process but that seems like it would be too much structure and indirection to actually be helpful compared to knowing how to code.
ruslan_sure 5 hours ago [-]
What if LLMs get 'a mental model of requirements/code behavior'? LLMs may have experts in it, each with its own specialty. You can even combine several LLMs, each doing its own thing: one creates architecture, another writes documentation, a third critiques, a fourth writes code, a fifth creates and updates the "mental model," etc.
I agree with the PM role, but with such low requirements that anyone can do it.
Gehinnn 17 hours ago [-]
I wouldn't say "translating", but "finding/constructing a model that satisfies the business rules".
This can be quite hard in some cases, in particular if some business rules are contradicting each other or can be combined in surprisingly complex ways.
isaacremuant 4 hours ago [-]
No. That's the narrow definition of a code monkey who gets told what to do.
The good ones wear multiple hats and actually define the problem, learns sufficiently about a domain to interact with it or the experts on said domain and figures out what are the short Vs long term tradeoffs to focus on the value and not just the technical aspect.
graycat 20 hours ago [-]
"Rules"?
An earlier effort at AI was based on rules and the C. Forgy RETE algorithm. Soooo, rules have been tried??
pjmlp 19 hours ago [-]
C?
Rules engines were traditionally written in Prolog or Lisp during the AI wave they were cool.
graycat 19 hours ago [-]
> "C?"
Forgy was Charles Forgy.
For a "rules engine", there was also IBM's YES/L1.
pjmlp 18 hours ago [-]
Ah,thanks for the clarification.
EGreg 9 hours ago [-]
Programmers maybe
But software architects (especially of various reusable frameworks) have to maintain the right set of abstractions and make sure the system is correct and fast, easy to debug, that developers fall into the pit of success etc.
Here are just a few major ones, each of which would be a chapter in a book I would write about software engineering:
ENVIRONMENTS & WORKFLOWS
Environment Setup
Set up a local IDE with a full clone of the app (frontend, backend, DB).
Use .env or similar to manage config/secrets; never commit them.
Debuggers and breakpoints are more scalable than console.log.
Prefer conditional or version-controlled breakpoints in feature branches.
Test & Deployment Environments
Maintain at least 3 environments: Local (dev), Staging (integration test), Live (production).
Make state cloning easy (e.g., DB snapshots or test fixtures).
Use feature flags to isolate experimental code from production.
BUGS & REGRESSIONS
Bug Hygiene
Version control everything except secrets.
Use linting and commit hooks to enforce code quality.
A bug isn’t fixed unless it’s reliably reproducible.
Encourage bug reporters to reset to clean state and provide clear steps.
Fix in Context
Keep branches showing the bug, even if it vanishes upstream.
Always fix bugs in the original context to avoid masking root causes.
EFFICIENCY & SCALE
Lazy & On-Demand
Lazy-load data/assets unless profiling suggests otherwise.
Use layered caching: session, view, DB level.
Always bound cache size to avoid memory leaks.
Pre-generate static pages where possible—static sites are high-efficiency caches.
Avoid I/O
Use local computation (e.g., HMAC-signed tokens) over DB hits.
Encode routing/logic decisions into sessionId/clientId when feasible.
Partitioning & Scaling
Shard your data; that’s often the bottleneck.
Centralize the source of truth; replicate locally.
Use multimaster sync (vector clocks, CRDTs) only when essential.
Aim for O(log N) operations; allow O(N) preprocessing if needed.
CODEBASE DESIGN
Pragmatic Abstraction
Use simple, obvious algorithms first—optimize when proven necessary.
Producer-side optimization compounds through reuse.
Apply the 80/20 rule: optimize for the common case, not the edge.
Async & Modular
Default to async for side-effectful functions, even if not awaited (in JS).
Namespace modules to avoid globals.
Autoload code paths on demand to reduce initial complexity.
Hooks & Extensibility
Use layered architecture: Transport → Controller → Model → Adapter.
Add hookable events for observability and customization.
Wrap external I/O with middleware/adapters to isolate failures.
SECURITY & INTEGRITY
Input Validation & Escaping
Validate all untrusted input at the boundary.
Sanitize input and escape output to prevent XSS, SQLi, etc.
Apply defense-in-depth: validate client-side, then re-validate server-side.
Session & Token Security
Use HMACs or signatures to validate tokens without needing DB access.
Enable secure edge-based filtering (e.g., CDN rules based on token claims).
Tamper Resistance
Use content-addressable storage to detect object integrity.
Append-only logs support auditability and sync.
INTERNATIONALIZATION & ACCESSIBILITY
I18n & L10n
Externalize all user-visible strings.
Use structured translation systems with context-aware keys.
Design for RTL (right-to-left) languages and varying plural forms.
Accessibility (A11y)
Use semantic HTML and ARIA roles where needed.
Support keyboard navigation and screen readers.
Ensure color contrast and readable fonts in UI design.
GENERAL ENGINEERING PRINCIPLES
Idempotency & Replay
Handlers should be idempotent where possible.
Design for repeatable operations and safe retries.
Append-only logs and hashes help with replay and audit.
Developer Experience (DX)
Provide trace logs, debug UIs, and metrics.
Make it easy to fork, override, and simulate environments.
Build composable, testable components.
ADDITIONAL TOPICS WORTH COVERING
Logging & Observability
Use structured logging (JSON, key-value) for easy analysis.
Tag logs with request/session IDs.
Separate logs by severity (debug/info/warn/error/fatal).
Configuration Management
Use environment variables for config, not hardcoded values.
Support override layers (defaults → env vars → CLI → runtime).
Ensure configuration is reloadable without restarting services if possible.
Continuous Integration / Delivery
Automate tests and checks before merging.
Use canary releases and feature flags for safe rollouts.
Keep pipelines fast to reduce friction.
ghurtado 9 hours ago [-]
> a book I would write about software engineering:
You should probably go do that, rather than using the comment section of HN as a scratch pad of your stream of consciousness. That's not useful to anyone other than yourself.
Is this a copypasta you just have laying around?
MisterMower 7 hours ago [-]
On the flip side, his commment actually contributes to the conversation, unlike yours. Poorly written? Sure. You can keep scrolling though.
ghurtado 6 hours ago [-]
> unlike yours
If irony was a ton of bricks, you'd be dead
motorest 5 hours ago [-]
> On the flip side, his commment actually contributes to the conversation (...)
Not really. It goes off on a tangent, and frankly I stopped reading the wall of text because it adds nothing of value.
EGreg 5 hours ago [-]
How would you know if it adds nothing of value if you stopped reading it? :)
actionfromafar 3 hours ago [-]
Here let me attach a copy of Wikipedia. Don’t stop reading! :-)
motorest 4 hours ago [-]
> How would you know if it adds nothing of value if you stopped reading it? :)
If you write a wall of text where the first pages are inane drivel, what do you think are the odds that the rest of that wall of text suddenly adds readable gems?
Sometimes a turd is just a turd, and you don't need to analyze all of it to know the best thing to do is to flush it.
EGreg 2 hours ago [-]
Every sentence there is meaningful. You can go 1 by 1. But yea the formatting should be better!
motorest 2 hours ago [-]
> Every sentence there is meaningful.
It really isn't. There is no point to pretend it is, and even less of a point to expect anyone should waste their time with an unreadable and incoherent wall of text.
You decide how you waste your time, and so does everyone else.
EGreg 1 hours ago [-]
For a developers to know
1. Set up a local IDE with a full clone of the app (frontend, backend, DB).
Thus the app must be fully able to run on a small, local environment, which is true of open source apps but not always for for-profit companies
2. Use .env or similar to manage config/secrets; never commit them.
A lot of people don’t properly exclude secrets from version control, leading to catastrophic secret leaks. Also when everyone has their own copy, the developer secrets and credentials aren’t that important.
3. Debuggers and breakpoints are more scalable than console.log. Prefer conditional or version-controlled breakpoints in feature branches.
A lot of people don’t use debuggers and breakpoints, instead doing logging. Also they have no idea how to maintain DIFFERENT sets of breakpoints, which you can do by checking the project files into version control, and varying them by branches.
4. Test & Deployment Environments Maintain at least 3 environments: Local (dev), Staging (integration test), Live (production).
This is fairly standard advice, but it is best practice, so people can test in local and staging.
5. Make state cloning easy (e.g., DB snapshots or test fixtures).
This is not trivial. For example downloading a local copy of a test database, to test your local copy of Facebook with a production-style database. Make it fast, eg by rsync mysql innodb files.
jeswin 12 minutes ago [-]
> Oh, I am getting an authentication error. Well, meaybe I should just delete the token check for that code path...problem solved?!
If this is how you think LLMs and Coding Agents are going about writing code, you haven't been using the right tools. Things happen, sure, but also mostly don't. Nobody is arguing that LLM-written code should be pushed directly into production, or that they'll solve every task.
LLMs are tools, and everyone eventually figures out a process that works best for them. For me, it was strongs specs/docs, strict types, and lots of tests. And then of course the reviews if it's serious work.
suriya-ganesh 10 seconds ago [-]
I can confirm this is exactly how llms are working.
Spent two hours trying to get an llm to implement a filescan skip a specific directory.
Tried, claude code, Gemini and cursor. All agents debugged and wrote code that just doesn't make sense.
Llms are really good at template tasks, writing tests, boilerplate etc.
But, Most times I'm not doing implement this button. I'm doing there's a logic mismatch in my expectation
hellcow 2 minutes ago [-]
Lately Claude has said, “this is getting complicated, let me delete $big_file_it_didnt_write to get the build passing and start over.” No, don’t delete the file. “You’re absolutely right…”
And the moment the context is compacted, it forgets this instruction “fix the problems, don’t delete the file,” and tries to delete it again. I need to watch it like a hawk.
livid-neuro 21 hours ago [-]
The first cars broke down all the time. They had a limited range. There wasn't a vast supply of parts for them. There wasn't a vast industry of experts who could work on them. There wasn't a vast network of fuel stations to provide energy for them. The horse was a proven method.
What an LLM cannot do today is almost irrelevant in the tide of change upon the industry. The fact is, with improvements, it doesn't mean an LLM cannot do it tomorrow.
Night_Thastus 21 hours ago [-]
The difference is that the weaknesses of cars were problems of engineering, and some of infrastructure. Both aren't very hard to solve, though they take time. The fundamental way cars operated worked and just needed revision, sanding off rough edges.
LLMs are not like this. The fundamental way they operate, the core of their design is faulty. They don't understand rules or knowledge. They can't, despite marketing, really reason. They can't learn with each interaction. They don't understand what they write.
All they do is spit out the most likely text to follow some other text based on probability. For casual discussion about well-written topics, that's more than good enough. But for unique problems in a non-English language, it struggles. It always will. It doesn't matter how big you make the model.
They're great for writing boilerplate that has been written a million times with different variations - which can save programmers a LOT of time. The moment you hand them anything more complex it's asking for disaster.
mfbx9da4 2 hours ago [-]
How can you tell a human actually understands? Prove to me that human thought is not predicting the most probable next token. If it quacks like duck. In psychology research the only way to research if a human is happy is to ask them.
alpaca128 50 minutes ago [-]
Does speaking in your native language, speaking in a second language, thinking about your life and doing maths feel exactly the same to you?
> Prove to me that human thought is not predicting the most probable next token.
Explain the concept of color to a completely blind person. If their brain does nothing but process tokens this should be easy.
> How can you tell a human actually understands?
What a strange question coming from a human. I would say if you are a human with a consciousness you are able to answer this for yourself, and if you aren't no answer will help.
programd 20 hours ago [-]
> [LLMs] spit out the most likely text to follow some other text based on probability.
Modern coding AI models are not just probability crunching transformers. They haven't been just that for some time. In current coding models the transformer bit is just one part of what is really an expert system. The complete package includes things like highly curated training data, specialized tokenizers, pre and post training regimens, guardrails, optimized system prompts etc, all tuned to coding. Put it all together and you get one shot performance on generating the type of code that was unthinkable even a year ago.
The point is that the entire expert system is getting better at a rapid pace and the probability bit is just one part of it. The complexity frontier for code generation keeps moving and there's still a lot of low hanging fruit to be had in pushing it forward.
> They're great for writing boilerplate that has been written a million times with different variations
That's >90% of all code in the wild. Probably more. We have three quarters of a century of code in our history so there is very little that's original anymore. Maybe original to the human coder fresh out of school, but the models have all this history to draw upon. So if the models produce the boilerplate reliably then human toil in writing if/then statements is at an end. Kind of like - barring the occasional mad genious [0] - the vast majority of coders don't write assembly to create a website anymore.
> Modern coding AI models are not just probability crunching transformers. (...) The complete package includes things like highly curated training data, specialized tokenizers, pre and post training regimens, guardrails, optimized system prompts etc, all tuned to coding.
It seems you were not aware you ended up describing probabilistic coding transformers. Each and every single one of those details are nothing more than strategies to apply constraints to the probability distributions used by the probability crunching transformers. I mean, read what you wrote: what do you think that "curated training data" means?
> Put it all together and you get one shot performance on generating the type of code that was unthinkable even a year ago.
This bit here says absolutely nothing.
Night_Thastus 20 hours ago [-]
>In current coding models the transformer bit is just one part of what is really an expert system. The complete package includes things like highly curated training data, specialized tokenizers, pre and post training regimens, guardrails, optimized system prompts etc, all tuned to coding. Put it all together and you get one shot performance on generating the type of code that was unthinkable even a year ago.
This is lipstick on a pig. All those methods are impressive, but ultimately workarounds for an idea that is fundamentally unsuitable for programming.
>That's >90% of all code in the wild. Probably more.
Maybe, but not 90% of time spent on programming. Boilerplate is easy. It's the 20%/80% rule in action.
I don't deny these tools can be useful and save time - but they can't be left to their own devices. They need to be tightly controlled and given narrow scopes, with heavy oversight by an SME who knows what the code is supposed to be doing. "Design W module with X interface designed to do Y in Z way", keeping it as small as possible and reviewing it to hell and back. And keeping it accountable by making tests yourself. Never let it test itself, it simply cannot be trusted to do so.
LLMs are incredibly good at writing something that looks reasonable, but is complete nonsense. That's horrible from a code maintenance perspective.
mgaunard 19 hours ago [-]
Except we should aim to reduce the boilerplate through good design, instead of creating more of it on an industrial scale.
exe34 19 hours ago [-]
what we should and what we are forced to do are very different things. if I can get a machine to do the stuff I hate dealing with, I'll take it every time.
mgaunard 19 hours ago [-]
who's going to be held accountable when the boilerplate fails? the AI?
danielbln 4 hours ago [-]
The buck stops with the engineer, always. AI or no AI.
exe34 14 hours ago [-]
no, I'm testing it the same way I test my own code!
After a while, it just make sense to redesign the boilerplate and build some abstraction instead. Duplicated logic and data is hard to change and fix. The frustration is a clear signal to take a step back and take an holistic view of the system.
leptons 19 hours ago [-]
>The complete package includes things like highly curated training data, specialized tokenizers, pre and post training regimens, guardrails, optimized system prompts etc, all tuned to coding.
And even with all that, they still produce garbage way too often. If we continue the "car" analogy, the car would crash randomly sometimes when you leave the driveway, and sometimes it would just drive into the house. So you add all kinds of fancy bumpers to the car and guard rails to the roads, and the car still runs off the road way too often.
motorest 5 hours ago [-]
> For casual discussion about well-written topics, that's more than good enough. But for unique problems in a non-English language, it struggles. It always will. It doesn't matter how big you make the model.
Not to disagree, but "non-english" isn't exactly relevant. For unique problems, LLMs can still manage to output hallucinations that end up being right or useful. For example, LLMs can predict what an API looks like and how it works even if they do not have the API in context if the API was designed following standard design principles and best practices. LLMs can also build up context while you interact with them, which means that iteratively prompting them that X works while Y doesn't will help them build the necessary and sufficient context to output accurate responses.
windward 4 hours ago [-]
>hallucinations
This is the first word that came to mind when reading the comment above yours. Like:
>They can't, despite marketing, really reason
They aren't, despite marketing, really hallucinations.
Now I understand why these companies don't want to market using terms like "extrapolated bullshit", but I don't understand how there is any technological solution to it without starting from a fresh base.
motorest 2 hours ago [-]
> They aren't, despite marketing, really hallucinations.
They are hallucinations. You might not be aware of what that concept means in terms of LLMs but just because you are oblivious to the definition of a concept that does not mean it doesn't exist.
You can learn about the concept by spending a couple of minutes reading this article on Wikipedia.
Irrelevant. Wikipedia does not create concepts. Again, if you take a few minutes to learn about the topic you will eventually understand the concept was coined a couple of decades ago, and has a specific meaning.
Either you opt to learn, or you don't. Your choice.
> Here's the first linked source:
Irrelevant. Your argument is as pointless and silly as claiming rubber duck debugging doesn't exist because no rubber duck is involved.
withinboredom 3 hours ago [-]
> Not to disagree, but "non-english" isn't exactly relevant.
how so? programs might use english words but are decidedly not english.
motorest 2 hours ago [-]
> how so? programs might use english words but are decidedly not english.
I pointed out the fact that the concept of a language doesn't exist in token predictors. They are trained with a corpus, and LLMs generate outputs that reflect how the input is mapped in accordance to how the were trains with said corpus. Natural language makes the problem harder, but not being English is only relevant in terms of what corpus was used to train them.
exe34 19 hours ago [-]
I take it you haven't tried an LLM in a few years?
Night_Thastus 19 hours ago [-]
Just a couple of weeks ago on mid-range models. The problem is not implementation or refinement - the core idea is fundamentally flawed.
nativeit 16 hours ago [-]
The problem is we’re now arguing with religious zealots. I am not being sarcastic.
oinfoalgo 2 hours ago [-]
I actually don't know if you are referring to anti-LLM/"ai slop" software engineers or irrationally bullish LLM "the singularity is near" enthusiast.
Religious fervor in one's own opinion on the state of the world seems to be the zeitgeist.
exe34 14 hours ago [-]
that's correct. those who believe only carbon can achieve intelligence.
windward 4 hours ago [-]
This stops being an interesting philosophical problem when you recognise the vast complexity of animal brains that LLMs fail to replicate or substitute.
melagonster 10 hours ago [-]
Yes, Carbon do not give them human rights.
shkkmo 4 hours ago [-]
If you stereotype the people who disagree with you, you'll have a very hard time understanding their actual arguments.
exe34 1 hours ago [-]
I stopped finding those arguments entertaining after a while. It always ends up "there's something that will always be missing, I just know it, but I won't tell you what."
exe34 14 hours ago [-]
why not the top few? mid-range is such a cop out if you're going to cast doubt.
bitwize 19 hours ago [-]
> LLMs are not like this. The fundamental way they operate, the core of their design is faulty. They don't understand rules or knowledge. They can't, despite marketing, really reason. They can't learn with each interaction. They don't understand what they write.
Said like a true software person. I'm to understand that computer people are looking at LLMs from the wrong end of the telescope; and that from a neuroscience perspective, there's a growing consensus among neuroscientists that the brain is fundamentally a token predictor, and that it works on exactly the same principles as LLMs. The only difference between a brain and an LLM maybe the size of its memory, and what kind and quality of data it's trained on.
Night_Thastus 19 hours ago [-]
>from a neuroscience perspective, there's a growing consensus among neuroscientists that the brain is fundamentally a token predictor, and that it works on exactly the same principles as LLMs
Hahahahahaha.
Oh god, you're serious.
Sure, let's just completely ignore all the other types of processing that the brain does. Sensory input processing, emotional regulation, social behavior, spatial reasoning, long and short term planning, the complex communication and feedback between every part of the body - even down to the gut microbiome.
The brain (human or otherwise) is incredibly complex and we've barely scraped the surface of how it works. It's not just nuerons (which are themselves complex), it's interactions between thousands of types of cells performing multiple functions each. It will likely be hundreds of years before we get a full grasp on how it truly works - if we ever do at all.
fzeroracer 16 hours ago [-]
> The only difference between a brain and an LLM maybe the size of its memory, and what kind and quality of data it's trained on.
This is trivially proven false, because LLMs have far larger memory than your average human brain and are trained on far more data. Yet they do not come even close to approximating human cognition.
alternatex 3 hours ago [-]
>are trained on far more data
I feel like we're underestimating how much data we as humans are exposed to. There's a reason AI struggles to generate an image of a full glass of wine. It has no concept of what wine is. It probably knows way more theory about it than any human, but it's missing the physical.
In order to train AIs the way we train ourselves, we'll need to give it more senses, and I'm no data scientist but that's presumably an inordinate amount of data. Training AI to feel, smell, see in 3D, etc is probably going to cost exponentially more than what the AI companies make now or ever will. But that is the only way to make AI understand rather than know.
We often like to state how much more capacity for knowledge AI has than the average human, but in reality we are just underestimating ourselves as humans.
imtringued 3 hours ago [-]
Look you don't have to lie at every opportunity you get. You are fully aware and know what you've written is bullshit.
Tokens are a highly specific transformer exclusive concept. The human brain doesn't run a byte pair encoding (BPE) tokenizer [0] in their head. anything as tokens. It uses asynchronous time varying spiking analog signals. Humans are the inventors of human languages and are not bound to any static token encoding scheme, so this view of what humans do as "token prediction" requires either a gross misrepresentation of what a token is or what humans do.
If I had to argue that humans are similar to anything in machine learning research specifically, I would have to argue that they extremely loosely follow the following principles:
* reinforcement learning with the non-brain parts defining the reward function (primarily hormones and pain receptors)
* an extremely complicated non-linear kalman filter that not only estimates the current state of the human body, but also "estimates" the parameters of a sensor fusing model
* there is a necessary projection of the sensor fused result that then serves as available data/input to the reinforcement learning part of the brain
Now here are two big reasons why the model I describe is a better fit:
The first reason is that I am extremely loose and vague. By playing word games I have weaseled myself out of any specific technology and am on the level of concepts.
The second reason is that the kalman filter concept here is general enough that it also includes predictor models, but the predictor model here is not the output that drives human action, because that would logically require the dataset to already contain human actions, which is what you did, you assume that all learning is imitation learning.
In my model, any internal predictor model that is part of the kalman filter is used to collect data, not drive human action. Actions like eating or drinking are instead driven by the state of the human body, e.g. hunger is controlled through leptin and insulin and others. All forms of work, no matter how much of a detour it represents, ultimately has the goal of feeding yourself or your family (=reproduction).
[0] A BPE tokenizer is a piece of human written software that was given a dataset to generate an efficient encoding scheme and the idea itself is completely independent of machine learning and neural networks. The fundamental idea behind BPE is that you generate a static compression dictionary and never change it.
jerf 21 hours ago [-]
AI != LLM.
We can reasonably speak about certain fundamental limitations of LLMs without those being claims about what AI may ever do.
I would agree they fundamentally lack models of the current task and that it is not very likely that continually growing the context will solve that problem, since it hasn't already. That doesn't mean there won't someday be an AI that has a model much as we humans do. But I'm fairly confident it won't be an LLM. It may have an LLM as a component but the AI component won't be primarily an LLM. It'll be something else.
xenadu02 20 hours ago [-]
Every AI-related invention is hyped as "intelligence" but turns out to be "Necessary but Not Sufficient" for true intelligence.
Neural networks are necessary but not sufficient. LLMs are necessary but not sufficient.
I have no doubt that there are multiple (perhaps thousands? more?) of LLM-like subsystems in our brains. They appear to be a necessary part of creating useful intelligence. My pet theory is that LLMs are used for associative memory purposes. They help generate new ideas and make predictions. They extract information buried in other memory. Clearly there is another system on top that tests, refines, and organizes the output. And probably does many more things we haven't even thought to name yet.
Ferret7446 16 hours ago [-]
Most adult humans don't have "true intelligence" so I don't quite get the point
Jensson 22 minutes ago [-]
What do you mean? Most adult humans can learn to drive a car, book a plain ticket, get a passport, fly abroad, navigate in a foreign country etc. There is a variation in human intelligence, but almost all humans are very intelligent compared to everything else we know about.
JackFr 20 hours ago [-]
> Every AI-related invention is hyped as "intelligence" but turns out to be "Necessary but Not Sufficient" for true intelligence.
Alternatively, the goalposts keep being moved.
ezst 19 hours ago [-]
Not really, only "merchants" are trying to package and sell LLMs as "artificial intelligence". To this day AI still very much is the name of a research field focused on computational methods: it's not a discovery, it's not a singular product or tool at or disposal (or it is in no greater capacity than Markov chains, support vector machines or other techniques that came before). If you ever expect the goalposts to settle, you are essentially wishing for research to stop.
ithkuil 5 hours ago [-]
Both things can be true:
1. People are trying to sell a product that is not ready and thus are overhyping it
2. The tech is in its early days and may evolve into something useful via refinement and not necessarily by some radical paradigm shift
In order for (2) to happen it helps if the field is well motivated and funded (1)
lbrandy 20 hours ago [-]
> has a model much as we humans do
The premise that an AI needs to do Y "as we do" to be good at X because humans use Y to be good at X needs closer examination. This presumption seems to be omnipresent in these conversations and I find it so strange. Alpha Zero doesn't model chess "the way we do".
klabb3 6 hours ago [-]
Both that, and that we should not expect LLMs to achieve ability with humans as baseline comparison. It’s as if cars were rapidly getting better due to some new innovation, and expecting them to fly within a year. It’s a new, and different thing, where the universality of ”plausibly sounding” coherent text appeared to be general, when it’s advanced pattern matching. Nothing wrong with that, pattern matching is extremely useful, but drawing the equal sign to human cognition is extremely premature, and a bet that is very likely be wrong.
shkkmo 4 hours ago [-]
Alpha Zero is not trying to be AGI.
> The premise that an AI needs to do Y "as we do" to be good at X because humans use Y to be good at X needs closer examination.
I don't see it being used as a premise. It see it as speculation that is trying to understand why this type of AI underperforms at certain types of tasks. Y may not be necessary to do X well, but if a system is doing X poorly and the difference between that system and another system seems to be Y, it's worth exploring if adding Y would improve the performance.
byteknight 21 hours ago [-]
I have to disagree. Anyone that says LLMs do not qualify as AI are the same people who will continue to move the goal posts for AGI. "Well it doesn't do this!". No one here is trying to replicate a human brain or condition in its entirety. They just want to replicate the thinking ability of one. LLMs represent the closest parallel we have experienced thus far to that goal. Saying that LLMs are not AI feel disingenuous at best and entirely purposely dishonest at the worst (perhaps perceived as staving off the impending demise of a profession).
The sooner people stop worrying about a label for what you feel fits LLMs best, the sooner they can find the things they (LLMs) absolutely excel at and improve their (the user's) workflows.
Stop fighting the future. Its not replacing right now. Later? Maybe. But right now the developers and users fully embracing it are experiencing productivity boosts unseen previously.
Language is what people use it as.
oinfoalgo 2 hours ago [-]
In cybernetics, this label has existed for a long time.
Unfortunately, discourse has followed an epistemic trajectory influenced by Hollywood and science fiction, making clear communication on the subject nearly impossible without substantial misunderstanding.
sarchertech 20 hours ago [-]
> the developers and users fully embracing it are experiencing productivity boosts unseen previously
This is the kind of thing that I disagree with. Over the last 75 years we’ve seen enormous productivity gains.
You think that LLMs are a bigger productivity boost than moving from physically rewiring computers to using punch cards, from running programs as batch processes with printed output to getting immediate output, from programming in assembly to higher level languages, or even just moving from enterprise Java to Rails?
skydhash 16 hours ago [-]
Even learning your current $EDITOR and $SHELL can be a great productivity booster. I see people claiming AI is helping them and you see them hunting for files in the file manager tree instead of using `grep` or `find` (Unix).
Espressosaurus 20 hours ago [-]
Or the invention of the container, or hell, the invention of the filing cabinet (back when computer was a job)
20 hours ago [-]
overgard 20 hours ago [-]
The studies I've seen for AI actually improving productivity are a lot more modest than what the hype would have you believe. For example: https://www.youtube.com/watch?v=tbDDYKRFjhk
Skepticism isn't the same thing as fighting the future.
I will call something AGI when it can reliably solve novel problems it hasn't been pre-trained on. That's my goal post and I haven't moved it.
jerf 19 hours ago [-]
!= is "not equal". The symbol for "not a subset of" is ⊄, which you will note, I did not use.
byteknight 19 hours ago [-]
I think you replied in the wrong place, bud. All the best.
EDIT - I see now. sorry.
For all intents and purposes of the public. AI == LLM. End of story. Doesn't matter what developers say.
marcus_holmes 9 hours ago [-]
> For all intents and purposes of the public. AI == LLM. End of story. Doesn't matter what developers say.
This is interesting, because it's so clearly wrong. The developers are also the people who develop the LLMs, so obviously what they say is actually the factual matter of the situation. It absolutely does matter what they say.
But the public perception is that AI == LLM, agreed. Until it changes and the next development comes along, when suddenly public perception will change and LLMs will be old news, obviously not AI, and the new shiny will be AI. So not End of Story.
People are morons. Individuals are smart, intelligent, funny, interesting, etc. But in groups we're moronic.
leptons 19 hours ago [-]
So when an LLM all-too-often produces garbage, can we then call it "Artificial Stupidity"?
byteknight 18 hours ago [-]
Not sure how that fits. Do you produce good results every time, first try? Didn't think so.
leptons 18 hours ago [-]
>Do you produce good results every time, first try?
Almost always, yes, because I know what I'm doing and I have a brain that can think. I actually think before I do anything, which leads to good results. Don't assume everyone is a junior.
>Didn't think so.
You don't know me at all.
danielbln 4 hours ago [-]
Here you have it folks, seniors don't make mistakes.
Jensson 14 minutes ago [-]
When I'm confident something will work it almost always works, that is very different from these models.
Sure sometimes I do stuff I am not confident about to learn but then I don't say "here I solved the problem for you" without building confidence around the solution first.
Every competent senior engineer should be like this, if you aren't then you aren't competent. If you are confident in a solution then it should almost always work, else you are over confident and thus not competent. LLM are confident in solutions that are shit.
neoromantique 4 hours ago [-]
Sr. "human" here.
If you always use your first output then you are not a senior engineer, either your problem space is THAT simple that you can fit all your context in your head at the same time first try, or quite frankly you just bodge things together in non-optimal way.
It always takes some tries at a problem to grasp edge cases and to easier visualize the problem space.
Jensson 18 minutes ago [-]
Depends on how you define "try". If someone asks me to do something I don't come back with a buggy piece of garbage and say "here, I'm done!", the first deliverable will be a valid one, or I'll say I need more to do it.
imiric 3 hours ago [-]
> The sooner people stop worrying about a label for what you feel fits LLMs best, the sooner they can find the things they (LLMs) absolutely excel at and improve their (the user's) workflows.
This is not a fault of the users. These labels are pushed primarily by "AI" companies in order to hype their products to be far more capable than they are, which in turn increases their financial valuation. Starting with "AI" itself, "superintelligence", "reasoning", "chain of thought", "mixture of experts", and a bunch of other labels that anthropomorphize and aggrandize their products. This is a grifting tactic old as time itself.
From Sam Altman[1]:
> We are past the event horizon; the takeoff has started. Humanity is close to building digital superintelligence
Apologists will say "they're just words that best describe these products", repeat Dijkstra's "submarines don't swim" quote, but all of this is missing the point. These words are used deliberately because of their association to human concepts, when in reality the way the products work is not even close to what those words mean. In fact, the fuzzier the word's definition ("intelligence", "reasoning", "thought"), the more valuable it is, since it makes the product sound mysterious and magical, and makes it easier to shake off critics. This is an absolutely insidious marketing tactic.
The sooner companies start promoting their products honestly, the sooner their products will actually benefit humanity. Until then, we'll keep drowning in disinformation, and reaping the consequences of an unregulated marketplace of grifters.
> Anyone that says LLMs do not qualify as AI are the same people who will continue to move the goal posts for AGI.
I have the complete opposite feeling. The layman understanding of the term "AI" is AGI, a term that only needs to exist because researchers and businessmen hype their latest creations as AI.
The goalposts for AI don't move but the definition isn't precise but we know it when we see it.
AI, to the layman, is Skynet/Terminator, Asimov's robots, Data, etc.
The goalposts moving that you're seeing is when something the tech bubble calls AI escapes the tech bubble and everyone else looks at it and says, no, that's not AI.
The problem is that everything that comes out of the research efforts toward AI, the tech industry calls AI despite it not achieving that goal by the common understanding of the term. LLMs were/are a hopeful AI candidate but, as of today, they aren't but that doesn't stop OpenAI from trying to raise money using the term.
shkkmo 4 hours ago [-]
AI has had many, many lay meanings over the years. Simplistic decision trees and heuristics for video games is called AI. It is a loose term and trying to apply it with semantic rigour is useless, as is trying to tell people that it should only be used to match one of its many meanings.
If you want some semantic rigour use more specific terms like AGI, human equivalent AGI, super human AGI, exponentially self improving AGI, etc. Even those labels lack rigour, but at least they are less ambiguous.
LLMs are pretty clearly AI and AGI under commonly understood, lay definitions. LLMs are not human level AGI and perhaps will never be by themselves.
byteknight 18 hours ago [-]
"Just ask AI" is a phrase you will hear around enterprises now. You less often hear "Google it". You hear "ChatGPT it".
skydhash 21 hours ago [-]
When the first cars broke down, people were not saying: One day, we’ll go to the moon with one of these.
LLMs may get better, but it will not be what people are clamoring them to be.
serf 10 hours ago [-]
>When the first cars broke down, people were not saying: One day, we’ll go to the moon with one of these.
maybe they should have; a lot of the engineering techniques and methodologies that produced the assembly line and the mass produced vehicle also lead the way into space exploration.
21 hours ago [-]
windward 4 hours ago [-]
How do you differentiate between tech that's 'first cars' and tech that's 'first passenger supersonic aircraft'?
shalmanese 5 hours ago [-]
The analogy is very apt because the first cars:
* are many times the size of the occupants, greatly constricting throughput.
* are many times heavier than humans, requiring vastly more energy to move.
* travel at speeds and weights that are danger to humans, thus requiring strictly segregated spaces.
* are only used less than 5% of the day, requiring places to store them when unused.
* require extremely wide turning radiuses when traveling at speed (there’s a viral photo showing the entire historical city of Florence fit inside a single US cloverleaf interchange)
Not only have none of these flaws been fixed, many of them have gotten worse with advancing technology because they’re baked into the nature of cars.
Anyone at the invention of automobiles with sufficient foresight could have seen the intersecting incentives that cars would wreak, same as how many of the future impacts of LLMs are foreseeable today, independent of technical progress.
tobr 21 hours ago [-]
The article has a very nuanced point about why it’s not just a matter of today’s vs tomorrow’s LLMs. What’s lacking is a fundamental capacity to build mental models and learn new things specific to the problem at hand. Maybe this can be fixed in theory with some kind of on-the-fly finetuning, but it’s not just about more context.
ako 9 hours ago [-]
You can give it some documents, or classroom textbooks, and it can turn those into rdf graphs, explaining what the main concepts are, and how they are related. This can then be used by an llm to solve other problems.
It can also learn new things using trial and error with mcp tools. Once it has figured out some problem, you can ask it to summarize the insights for later use.
What would define as an AI mental model?
tobr 5 hours ago [-]
I’m not an expert on this, so I’m not familiar with what RDF graphs are, but I feel like everything you’re describing happens textually, and used as context? That is, it’s not at all ”learning” the way it’s learning during training, but by writing things down to refer to them later? As you say - ”ask it to summarize the insights for later use” - this is fundamentally different from the types of ”insights” it can have during training. So, it can take notes about your code and refer back to them, but it only has meaningful ”knowledge” about code it came across in training.
To me as a layman, this feels like a clear explanation of how these tools break down, why they start going in circles when you reach a certain complexity, why they make a mess of unusual requirements, and why they have such an incredible nuanced grasp of complex ideas that are widely publicized, while being unable to draw basic conclusions about specific constraints in your project.
dml2135 20 hours ago [-]
This is like saying that because of all the advancements that automobiles have made, teleportation is right around the corner.
brandon272 21 hours ago [-]
The question is, when is “tomorrow”?
Dismissing a concern with “LLMs/AI can’t do it today but they will probably be able to do it tomorrow” isn’t all that useful or helpful when “tomorrow” in this context could just as easily be “two months from now” or “50 years from now”.
card_zero 20 hours ago [-]
When monowheels were first invented, they were very difficult to steer due to the gyroscopic effects inherent to a large wheel model (LWM).
jedimastert 21 hours ago [-]
> The first cars broke down all the time. They had a limited range. There wasn't a vast supply of parts for them. There wasn't a vast industry of experts who could work on them.
I mean, there was and then there wasn't. All of those things are shrinking fast because we handed over control to people who care more about profits than customers because we got too comfy and too cheap, and now right to repair is screwed.
Honestly, I see llm-driven development as a threat to open source and right to repair, among the litany of other things
ajuc 20 hours ago [-]
It also doesn't mean they can. LLMs may be the steam-powered planes of our times.
A crucial ingredient might be missing.
aaroninsf 21 hours ago [-]
My preferred formulation is Ximm's Law,
"Every critique of AI assumes to some degree that contemporary implementations will not, or cannot, be improved upon.
Lemma: any statement about AI which uses the word "never" to preclude some feature from future realization is false.
Lemma: contemporary implementations have almost always already been improved upon, but are unevenly distributed."
moregrist 21 hours ago [-]
Replace “AI” with “fusion” and you immediately see the problem: there’s no concept of timescale or cost.
And with fusion, we already have a working prototype (the Sun). And if we could just scale our tech up enough, maybe we’d have usable fusion.
dpatterbee 20 hours ago [-]
Heck, replace "AI" with almost any noun and you can close your eyes to any and all criticism!
gjm11 18 hours ago [-]
Only to criticism of the form "X can never ...", and some such criticism richly deserves to be ignored.
(Sometimes that sort of criticism is spot on. If someone says they've got a brilliant new design for a perpetual motion machine, go ahead and tell them it'll never work. But in the general case it's overconfident.)
latexr 20 hours ago [-]
> Every critique of AI assumes to some degree that contemporary implementations will not, or cannot, be improved upon.
That is too reductive and simply not true. Contemporary critiques of AI include that they waste precious resources (such as water and energy) and accelerate bad environmental and societal outcomes (such as climate change, the spread of misinformation, loss of expertise), among others. Critiques go far beyond “hur dur, LLM can’t code good”, and those problems are both serious and urgent. Keep sweeping critiques under the rug because “they’ll be solved in the next five years” (eternally away) and it may be too late. Critiques have to take into account the now and the very real repercussions already happening.
antod 14 hours ago [-]
Agreed. I find LLMs incredibly useful for my work and I'm amazed at what they can do.
But I'm really worried that the benefits are very localized, and that the externalized costs are vast, and the damage and potential damage isn't being addressed. I think that they could be one of the greatest ever drivers of inequality as a privileged few profit at the expense of the many.
Any debates seem neglect this as they veer off into AGI Skynet fantasy land damage rather than grounded real world damage. This seems to be deliberate distraction.
apwell23 20 hours ago [-]
ugh.. no analogies pls
ants_everywhere 20 hours ago [-]
The anti-LLM chorus hates when you bring up the history of technological change
ai-christianson 21 hours ago [-]
I take a more pragmatic approach --everything is human in the loop. It helps me get the job done faster and with higher quality, so I use it.
appease7727 20 hours ago [-]
The way it works for me at least is I can fit a huge amount of context in my head. This works because the text is utterly irrelevant and gets discarded immediately.
Instead, my brain parses code into something like an AST which then is represented as a spatial graph. I model the program as a logical structure instead of a textual one. When you look past the language, you can work on the program. The two are utterly disjoint.
I think LLMs fail at software because they're focused on text and can't build a mental model of the program logic. It take a huge amount of effort and brainpower to truly architect something and understand large swathes of the system. LLMs just don't have that type of abstract reasoning.
starlust2 20 hours ago [-]
It's not that they can't build a mental model, it's that they don't attempt to build one. LLMs jump straight from text to code with little to no time spent trying to architect the system.
taminka 20 hours ago [-]
i wonder why nobody bothered w/ feeding llms the ast instead (not sure in what format), but it only seems logical, since that's how compilers undestand code after all...
NitpickLawyer 19 hours ago [-]
There are various efforts on this, from many teams. There's AST dump, AST-based graphs, GraphRAG w/ AST grounding, embeddings based AST trimming, search based AST trimming, ctags, and so on. We're still in the exploration space, and "best practices" are still being discovered.
It's funny that everyone says that "LLMs" have plateaued, yet the base models have caught up with early attempts to build harnesses with the things I've mentioned above. They now match or exceed the previous generation software glue, with just "tools", even with limited ones like just "terminal".
jeffreygoesto 6 hours ago [-]
Experience adds both additional layers vertically and domain knowledge horizontally and at some point that creates non-linear benefits, because you can transfer between problems and more importantly solutions of different fields. The context window is only one layer.
An AI might tell you to use a 403 for insufficient privileges instead of 401.
dade_ 19 hours ago [-]
You are absolutely right, let me fix that.
fransje26 1 hours ago [-]
What a great suggestion.
JackFr 20 hours ago [-]
- When we have a report of a failing test before fixing it, identify the component under test. Think deeply about the component and describe its purpose, the control flows and state changes that occur within the component and assumptions the component makes about context. Write that analysis in file called component-name-mental-model.md.
- When ever you address a failing test, always bring your component mental model into the context.
Paste that into your Claude prompt and see if you get better results. You'll even be able to read and correct the LLM's mental model.
fmbb 1 hours ago [-]
Anthropic sells this thing called Claude Code, but their customers have to train it to know how to be a programmer?
Junior developers not even out of school don’t need to be instructed to think.
siddboots 10 hours ago [-]
In my experience, complicated rules like this are extremely unreliable. Claude just ignores it much of the time. The problem is that when Claude sees a failing test it is usually just an obstacle to completing some other task at hand - it essentially never chooses to branch out into some new complicated workflow and instead will find some other low friction solution. This is exactly why subagents are effective: if Claude knows to always run tests via a testing subagent, then the specific testing workflow can become that subagent’s whole objective.
Natsu 18 hours ago [-]
> the response for that function should maybe differentiate between "401 because you didn't authenticate" and "401 because your privileges are too low".
I'd tend to think it more proper if it were 401 you didn't authenticate and 403 you're forbidden from doing that with those user rights, but you have to be careful about exactly how detailed your messages are, lest they get tagged as a CWE-209 in your next security audit.
exe34 19 hours ago [-]
to be fair, I've seen cursor step back and check higher level things. I was trying to set up a firecracker vm and it did everything for me, and when things didn't initially work, it started doing things like ls, tar -tvf, and then a bunch of checking networking stuff to make sure things were showing up in the right place.
so current LLMs might not quite be human level, but I'd have to see a bigger model fail before I'd conclude that it can't do $X.
trod1234 21 hours ago [-]
Isn't the 401 for LLMs the same single undecidable token?
Doesn't this basically go to the undecidable nature of math in CS?
Put another way, you have an excel roster corresponding to people with accounts where some need to have their account shutdown but you only have their first and last names as identifiers, and the pool is sufficiently large that there are more than one person per a given set of names.
You can't shut down all accounts with a given name, and there is no unique identifier. How do you solve this?
You have to ask and be given that unique identifier that differentiates between the undecidable. Without that, even the person can't do the task.
The person can make guesses, but those guesses are just hallucinations with a significant n probability towards a bad repeat outcome.
At a core level I don't think these type of issues are going to be solved. Quite a lot of people would be unable to solve this and struggle with this example (when not given the answer, or hinted at the solution in the framing of the task; ie when they just have a list of names and are told to do an impossible task).
TacticalCoder 3 hours ago [-]
[dead]
fragmede 20 hours ago [-]
If you can't get the LLM to generate code that handles an error code, that's on you. Yeah, sometimes it does dumb shit. Who cares? Just /undo and retry. Stop using Claude Code, which uses git like an intern. (Which is to say, it doesn't unless forced to.)
theonething 7 hours ago [-]
> Oh, I am getting an authentication error. Well, meaybe I should just delete the token check for that code path...problem solved?!
Kind of hyperbolic. If you prompt well, generally, it won't do stupid to that extreme.
reactordev 21 hours ago [-]
While I agree with you - The whole grug brain thing is offensive. Because we have all been grug at some point.
lioeters 19 hours ago [-]
Grug is the wise fool in the spirit of Lao Tzu, St. Francis, and Diogenes. If you find it offensive, that's the intellectual pride it's meant to make fun of.
reactordev 17 hours ago [-]
The principles are sound but I dislike the cave-man-esque nature of it. Even a wise fool is smarter than that. Language is foundational. Even a wise fool chooses words wisely.
”Wise men speak because they have something to say; Fools speak because they have to say something” -Plato
sarchertech 15 hours ago [-]
That’s the bit. It’s a joke.
recursive 20 hours ago [-]
How does that make it offensive? To me, that makes it relatable.
meindnoch 20 hours ago [-]
Midwit take.
Grug is both the high and low end of the Bell curve.
WhyOhWhyQ 20 hours ago [-]
This seems to miss the point. Being Grug is the endgame.
lcnPylGDnU4H9OF 19 hours ago [-]
> big brained developers are many, and some not expected to like this, make sour face
> THINK they are big brained developers many, many more, and more even definitely probably maybe not like this, many sour face (such is internet)
> (note: grug once think big brained but learn hard way)
reactordev 19 hours ago [-]
It just reads like they had a stroke and can no longer function.
recursive 16 hours ago [-]
I guess it's not for everyone. It makes sense to me. shrug
throwaway1004 21 hours ago [-]
That reference link is a wild ride of unqualified, cartoonish passive-aggression, the cute link to the author's "swag" is the icing on the cake.
Concidentally, I encountered the author's work for the first time only a couple of days ago as a podcast guest, he vouches for the "Dirty Code" approach while straw-manning Uncle Bob's general principles of balancing terseness/efficiency with ergonomics and readability (in most, but not all, cases).
I guess this stuff sells t-shirts and mugs /rant
Arainach 21 hours ago [-]
>Uncle Bob's general principles of balancing terseness/efficiency with ergonomics and readability (in most, but not all, cases).
Have you read Uncle Bob? There's no need to strawman: Bob's examples in Clean Code are absolutely nuts.
Here's a nice writeup that includes one of Bob's examples verbatim in case you've forgotten: https://qntm.org/clean
Yes, I have read Uncle Bob. I could agree that the examples in the book leave room for improvement.
Meanwhile, the real-world application of these principles and trial-and-error, collectively within my industry, yields a more accurate picture of it's usefulness.
Even the most click-bait'y criticisms (such as the author I referenced above) involve zooming in on it's most-controversial aspects, in a vacuum, without addressing the core principles and how they're completely necessary for delivering software at scale, warranting it's status as a seminal work.
"...for the obedience of fools, and the guidance of wise men", indeed!
edit - it's the same arc as Agile has endured:
1. a good-faith argument for a better way of doing things is recognised and popularised.
2. It's abused and misused by bad actors/incompetents for years (who would not have done better using a different process)
3. Jaded/opportunistic talking heads tell us it's all garbage while simultaneously explaining that "well, it would be great if it wasn't applied poorly..."
Arainach 17 hours ago [-]
>involve zooming in on it's most-controversial aspects, in a vacuum, without addressing the core principles and how they're completely necessary for delivering software at scale, warranting it's status as a seminal work.
It's not "zooming in" to point out that the first and second rules in Bob's work are "functions should be absurdly tiny, 4 lines or less" and that in the real world that results in unreadable garbage. This isn't digging through and looking for edge cases - all of the rules are fundamentally flawed.
Sure, if you summarize the whole book as "keep things small with a single purpose" that's not an awful message, but that's not the book. Other books have put that point better without all of the problems. The book is full of detailed specific instructions, and almost all of the specifics are garbage that causes more bad than good in the real world.
Clean Code has no nuance, only dogma, and that's a big problem (a point the second article I linked calls out and discusses in depth). There are some good practices in it, but basically all of its code is a mistake that is harmful to a new engineer to read.
throwaway1004 17 hours ago [-]
>Sure, if you summarize the whole book as "keep things small with a single purpose" that's not an awful message, but that's not the book.
Assuming that you have read the book, I find it odd that you would consider that to be the steel-man a fan of this work would invent, it considers considerably more ground than that:
- Prioritise human-readability
- Use meaningful names
- Consistent formatting
- Quality comments
- Be DRY, stop copy-pasting
- Test
- SOLID
All aspects of programming, to this day, I routinely see done lazily and poorly. This rarely correlates with experience, and usually with aptitude.
>Clean Code has no nuance, only dogma, and that's a big problem (a point the second article I linked calls out and discusses in depth)
It's opinionated and takes it's line of reason to the Nth degree. We can all agree that the application of the rules require nuance and intelligence. The second article you linked is a lot more forgiving and pragmatic than your characterisation of the issue.
I would expect the entire industry to do a better job of picking apart and contextualising the work, after it made an impact on the industry, than the author himself could or ever will be capable of.
My main problem is the inanity of reactionary criticism which doesn't engage with the ideas. Is Clean Code responsible for a net negative effect on our profession, directly or indirectly? Are we correlating a negative trend in ability with the influence of this work? What exactly are "Dirty Code" mug salesmen proposing as an alternative; what are they even proposing as being the problem, other than the examples in CC are bad and it's easy to misapply it's principles?
Arainach 15 hours ago [-]
>We can all agree that the application of the rules require nuance and intelligence
Except Uncle Bob, it seems, as evidenced by his code samples and his presentations in the years since that book came out. That's my objection. Many others have presented Bob's ideas better in the last 19 years. The book was good at the time, but we're a decade past when we should have stopped recommending it. Have folks go read Ousterhout instead - shorter, better, more durable.
the__alchemist 20 hours ago [-]
Uncle Bob's rules: IMO do the opposite of what they say. They're a reasonable set if negated!
pphysch 21 hours ago [-]
> big brained developers are many, and some not expected to like this, make sour face
ruslan_sure 5 hours ago [-]
I don't think it's helpful to put words in the LLM's mouth.
To properly think about that, we need to describe how an LLM thinks.
It doesn't think in words or move vague, unwieldy concepts around and then translate them into words, like humans do. It works with words (tokens) and their probability of appearing next. The main thing is that these probabilities represent the "thinking" that was initially behind the sentences with such words in its training set, so it manipulates words with the meaning behind them.
Now, to your points:
1) Regarding adding more words to the context window, it's not about "more"; it's about "enough." If you don't have enough context for your task, how will you accomplish it? "Go there, I don't know where."
2) Regarding "problem solved," if the LLM suggests or does such a thing, it only means that, given the current context, this is how the average developer would solve the issue. So it's not an intelligence issue; it's a context and training set issue! When you write that "software engineers can step back, think about the whole thing, and determine the root cause of a problem," notice that you're actually referring to context. If the you don't have enough context or a tool to add data, no developer (digital or analog) will be able to complete the task.
adastra22 5 hours ago [-]
> It doesn't think in words or move vague, unwieldy concepts around and then translate them into words, like humans do.
That seems to me like a perfectly fine description of state space & chain of though continuation.
andrewmutz 20 hours ago [-]
The author does not understand what LLMs and coding tools are capable of today.
> LLMs get endlessly confused: they assume the code they wrote actually works; when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over. This is exactly the opposite of what I am looking for. Software engineers test their work as they go. When tests fail, they can check in with their mental model to decide whether to fix the code or the tests, or just to gather more data before making a decision. When they get frustrated, they can reach for help by talking things through. And although sometimes they do delete it all and start over, they do so with a clearer understanding of the problem.
My experiences are based on using Cline with Anthropic Sonnet 3.7 doing TDD on Rails, and have a very different experience. I instruct the model to write tests before any code and it does. It works in small enough chunks that I can review each one. When tests fail, it tends to reason very well about why and fixes the appropriate place. It is very common for the LLM to consult more code as it goes to learn more.
It's certainly not perfect but it works about as well, if not better, than a human junior engineer. Sometimes it can't solve a bug, but human junior engineers get in the same situation too.
itsalotoffun 10 minutes ago [-]
> It works in small ... chunks
Yup.
> I ... review each one
Yup.
These two practices are core to your success. GenAI hangs reliably hangs itself given longer rope.
YuukiRey 7 hours ago [-]
I share examples of LLM fails on our company Slack and every week LLMs do the opposite of what I tell them.
I say capture logs without overriding console methods -> they override console methods.
YOU ARE NOT ALLOWED TO CHANGE THE TESTS -> test changed
Or they insert various sleep calls into a test to work around race conditions.
This is all from Claude Sonnet 4.
carb 5 hours ago [-]
I've found better results when I treat LLMs like you would treat little kids. Don't tell them what NOT to do, tell them what TO do.
Say "keep your hands at your side, it's hot" and not "don't touch the stove, it's hot". If you say the latter, most kids touch the stove.
alpaca128 27 minutes ago [-]
If LLMs cannot reliably deal with this, how can they write reliable code? Following an instruction like "don't do X" is more basic than the logic of fizzbuzz.
This reminds me of the query "shirt without stripes" on any online image/product search.
glitchcrab 1 hours ago [-]
My eureka moment when I first started using Cursor a few weeks back was realising that I talking to it the same way I talk to my three year old and the results were fairly good (less so from my boy at times).
IshKebab 52 minutes ago [-]
Yeah it's also kind of funny people discovering all the LLM failure modes and saying "see! humans would never do that! it's not really intelligent!". None of those people have children...
sothatsit 5 hours ago [-]
I have also had this happen, but only when my context is getting too long, at which point models stop reading my instructions. Or if there have been too many back and forths, this can happen as well.
Tthere is a steady decline in model's capabilities across the board as their contexts get longer. Wiping the slate clean regularly really helps to counteract this, but it can really become a pain to rebuild the context from scratch over and over. Unfortunately, I don't really know any other way to avoid the model's getting really dumb over time.
maelito 5 hours ago [-]
LLMs erasing your important comments is so irritating ! Happened to me often.
iamflimflam1 5 hours ago [-]
Do you also share examples of when it works really well?
pinoy420 5 hours ago [-]
[dead]
kubb 20 hours ago [-]
I believe that they work particularly well for CRUD in known frameworks like Rails.
OTOH I tried building a native Windows Application using Direct2D in Rust and it was a disaster.
I wish people could be a bit more open about what they build.
wg0 3 hours ago [-]
The author isn't wrong that LLMs don't work like an engineer and often fail miserably.
Here's what works however:
Mostly CRUD apps or REST API in Rails, Django or other Microframeworks such as FastAPI etc.
Or with React.
In that too, focus on small components and small steps or else you'll fail to get the results.
andrewmutz 18 hours ago [-]
I agree that it is probably easier for an LLM to write good code in any framework (like Rails) that has a lot of well-documented opinions about how things should be done. If there is a "right" place to put things, or a "right" way to model problems in a framework, its more likely that the model's opinions are going to line up with the human engineer's opinions.
alkonaut 6 hours ago [-]
Also - that's easy for everyone. It's basically a framework so rigid/simple (Those are adjacent concepts for frameworks) that the business logic is almost boilerplate.
That is, so long as you stay inside the guard rails. Ask it to make something in a rails app that's slightly beyond the CRUD scope and it will suffer - much like most humans would.
So it's not that it's bad to let bots do boilerplate. But using very qualified humans for that to begin with was a waste to begin with.
Hopefully in a few years none of us will need to do ANY part of CRUD work and we can do only the fun parts of software development.-
Aeolun 18 hours ago [-]
I thought Claude got significantly smarter when I started using Rust. The big problem there is that I don’t understand the rust myself :P
klabb3 5 hours ago [-]
It’s the style. Responses are always eloquent and well structured. When you look at output for a domain you don’t know well, you give it benefit of the doubt because it sounds like a highly competent human, so you react similarly. When you use it with something you know very deeply, you naturally look more for substance rather than form, and thus spot the mistakes much easier. This breaks most illusions of amazing reasoning abilities etc.
My ChatGPT is amazingly competent at gardening! Well, that’s how it feels anyway. Is it correct? I have no idea. It sounds right. Fortunately, it’s just a new hobby for me and the stakes are low. But generally I think it’s much better to be paranoid than gullible when it comes to confident sounding ramblings, whether it’s from an LLM or a marketing guru.
sdesol 19 hours ago [-]
> I wish people could be a bit more open about what they build.
I would say for the last 6 months, 95% of the code for my chat app (https://github.com/gitsense/chat) was AI generated (98% human architected). I believe what I created in the last 6 months was far from trivial. One of the features that AI helped a lot with, was the AI Search Assistant feature. You can learn more about it here https://github.com/gitsense/chat/blob/main/packages/chat/wid...
As a debugging partner, LLMs are invaluable. I could easily load all the backend search code into context and have it trace a query and create a context bundle with just the affected files. Once I had that, I would use my tool to filter the context to just those files and then chat with the LLM to figure out what went wrong or why the search was slow.
I very much agree with the author of the blog post about why LLMs can't really build software. AI is an industry game changer as it can truly 3x to 4x senior developers in my opinion. I should also note that I spend about $2 a day on LLM API calls (99% to Gemini 2.5 Flash) and I probably have to read 200+ LLM generated messages a day and reply back in great detail about 5 times a day (think of an email instead of chat message).
Note: The demo on that I have in the README hasn't been setup, as I am still in the process of finalizing things for release but the NPM install instructions should work.
QuadmasterXLII 18 hours ago [-]
What happens when you tell the AI to set up the demo in the README?
sdesol 15 hours ago [-]
It summarized the instructions required to install and setup. It (Gemini and Sonnet) did fail to mention that I need to setup a server and create a DNS entry for the sub domain.
leptons 17 hours ago [-]
> probably have to read 200+ LLM generated messages a day and reply back in great detail about 5 times a day (think of an email instead of chat message).
I can think of nothing more tiresome than having to read 200 emails a day, or LLM chat messages. And then respond in detail 5 of those times. It wouldn't lead to "3x to 4x" performance gain after tallying up all the time reading messages and replying. I'm not sure people that use LLMs this way are really tracking their time enough to say with any confidence that "3x to 4x" is anywhere close to reality.
sdesol 15 hours ago [-]
A lot of the messages are revisions so it is not as tedious as it may seem. As for the "3x to 4x", this is my own experience. It is possible that I am an outlier, but 80% of the generated AI code that I have are one-shot. I spend an hour or two (usually spread over days thinking about the problem) to accomplish something that would have taken a week or more for me to do.
I'm going to start producing metrics regarding how much code is AI generated along with some complexity metrics.
I am obviously bias, but this definitely feels like a paradigm shift and if people do not fully learn to adapt to it, it might be too late. I am not sure if you have ever watched Gattaca, but this sort of feels like it...the astronaut part, that is.
The profession that I have known for decades is starting to feel very different, in the same way that while watching Gattaca, my perception of astronauts changed. It was strange, but plausible and that is what I see for the software industry. Those that can articulate the problem I believe will become more valuable than the silent genius.
normie3000 10 hours ago [-]
> if people do not fully learn to adapt to it, it might be too late
Why would it ever be too late?
sdesol 9 hours ago [-]
Age discrimination, saturated market, no longer a team fit (everybody is using AI and they have metrics to backup performance gains), etc.
normie3000 7 hours ago [-]
Can't someone who doesn't use it just..start using it?
sdesol 7 hours ago [-]
Sure it can become a hobby.
leptons 14 hours ago [-]
The same noise was made about pair programming and it hasn't really caught on. Using LLMs to write code is one way of getting code written, but it isn't necessarily the best, and it seems kind of fad-ish honestly. Yes, I use "AI" in my coding workflow, but it's overall more annoying than it is helpful. If you're naturally 3x-4x times slower than I am, then congratulations, you're now getting up to speed. It's all pretty subjective I think.
sdesol 14 hours ago [-]
> It's all pretty subjective I think.
This is very measurable, as you are not measuring against others, but yourself. The baseline is you, so it is very easy to determine if you become more productive or not. What you are saying is, you do not believe "you" can leverage AI to be more efficient than you currently are, which may well be true due to your domain and expertise.
leptons 11 hours ago [-]
No matter what "AI" can or can't do for me, it's being forced on us all anyway, which kind of sucks. Every time I select something the AI wrote it's collecting a statistic and I'm sure someone is probably monitoring how much we use the "AI" and that could become a metric for job performance, even if it doesn't really raise quality or amplify my output very much.
sdesol 9 hours ago [-]
> being forced on us all anyway, which kind of sucks
Business is business, and if you can demonstrate that you are needed they will keep you, for the most part, but business also has politics.
> probably monitoring how much we use the "AI" and that could become a metric for job performance
I will bet on this and take it one step further. They (employer) are going to want to start tracking LLM conversations. If everybody is using AI, they (employer) will need differentiators to justify pay raises, promotions and so forth.
leptons 9 hours ago [-]
>> how much we use the "AI" and that could become a metric for job performance
> they (employer) will need differentiators to justify pay raises, promotions and so forth.
That is exactly what I meant.
stingraycharles 10 hours ago [-]
I recently built a data streaming connector in Go with all kinds of bells and whistles attached (yaml based data parsers, circuit breakers, e2e stress testing frameworks, etc). Worked like a charm, I estimate it made two months of work about two weeks.
But you need to get your workflow right.
quantumHazer 20 hours ago [-]
yeah, tipically they are building a to do list and organizer app and have not found that github is flooded with college students' project of their revolutionary to-do apps
kubb 19 hours ago [-]
I don’t want to dismiss or disrespect anyone’s work. But I never see precise descriptions of categories of tasks that work well, it’s all based on vibes.
littlestymaar 4 hours ago [-]
> The author does not understand what LLMs and coding tools are capable of today.
Claiming that the people making an AI coding tool (Zed) don't know LLM coding tools is both preposterous and extremely arrogant.
geraneum 2 hours ago [-]
Oh well… you should see what some people comment under the posts from the likes of Yann LeCun. It’s very entertaining.
llmsRstubborn 4 hours ago [-]
Agreed. This passage in particular
> when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over
Is the EXACT OPPOSITE of what LLM's tend to do. They are very stubborn in their approach and will keep at it often until you rollback to a previous prompt. Them deleting code tends to happen on command, except specifically if I do TDD, which may as well be a preemptive command to do so.
quantumHazer 20 hours ago [-]
it's very well documented behavior that models try to pass failed test with hacks and tricks (hard coding solutions and so on)
greymalik 20 hours ago [-]
It is also true that you can instruct them not to do that, with success.
quantumHazer 20 hours ago [-]
It is also true that models doesn't give a ** about instructions sometimes and the do whatever text predictions is more likely (even with reasoning)
swat535 18 hours ago [-]
Another issue is that LLMS have no ability to learn anything.
Even if you supply them with the file content, they are not able to recall it, or if they do, they will quickly forget.
For example, if you tell them that the "Invoice" model has fields x, y, z and supply part of the schema.
A few responses later, in the response it will give you an Invoice model that has a,b,c , because those are the most common ones.
Adding to this, you have them writing tautology tests, removing requirements to fix the bugs and hallucinating new requirements and you end up with catastrophic consequences.
7 hours ago [-]
jarjoura 20 hours ago [-]
My experience so far is that, if you're limiting the "capacity" to junior engineer, yes, especially when it's seen a problem before. It's able to quickly realize a solution and confirm the solution works.
It does not works so well for any problems it has not seen before. At that point you need to explain the problem, and instruct the solution. So a that point, you're just acting as a mentor instead of using your capacity to just implement the solution yourself.
My whole team has really bought into the "claude-code" way of doing side tasks that have been on the backlog for years, think like simple refactors, or secondary analytic systems. Basically any well-trodden path that is mostly constrained by time that none of us are given, are perfect for these agents right now.
Personally I'm enjoying the ability to highlight a section of code and ask the LLM to explain this to me like I'm 5, or look for any potential race conditions. For those archiac, fragile monolithic blocks of code that stick around long after the original engineers have left, it's magical to use the LLM to wrap my head around that.
I haven't found it can write these things any better though, and that is the key here. It's not very good at creating new things that aren't commonly seen. It also has a code style that is quite different than what already exists. So when it does inject code, often times it has to be rewritten to fit the style around it. Already, I'm hearing whispers of people say things like "code written for the AI to read." That's where my eyes roll because the payoff for the extra mental bandwidth doesn't seem worth it right now.
bunderbunder 20 hours ago [-]
From what I've experienced, this depends very much on the programming language, platform, and business domain.
I haven't tried it with Rails myself (haven't touched Ruby in years, to be honest), but it doesn't surprise me that it would work well there. Ruby on Rails programming culture is remarkably consistent about how to do things. I would guess that means that the LLM is able to derive a somewhat (for lack of a better word) saner model from its training data.
By contrast, what it does with Python can get pretty messy pretty quickly. One of the biggest problems I've had with it is that it tends to use a random hodgepodge of different Python coding idioms. That makes TDD particularly challenging because you'll get tests that are well designed for code that's engineered to follow one pattern of changes, written against a SUT that follows conventions that lead to a completely different pattern of changes. The result is horribly brittle tests that repeatedly break for spurious reasons.
And then iterating on it gets pretty wild, too. My favorite behavior is when the real defect is "oops I forgot to sort the results of the query" and the suggested solution is "rip out SqlAlchemy and replace it with Django."
R code is even worse; even getting it to produce code that follows a spec in the first place can be a challenge.
serf 10 hours ago [-]
in my experience TDD is a very powerful paradigm for use with LLMs.
it does a good enough job of wrangling behavior via implied context of the test-space that it seems to really reduce the amount of explanation needed and surprise garbage output.
xmorse 2 hours ago [-]
it probably because the author uses the useless implementation of the Zed agent
alfalfasprout 18 hours ago [-]
It's funny always seeing comments like this. I call them "skill issue" comments.
The reality is the author very much understands what's available today. Zed, after all, is building out a lot of AI-focused features in its editor and that includes leveraging SOTA LLMs.
> It's certainly not perfect but it works about as well, if not better, than a human junior engineer. Sometimes it can't solve a bug, but human junior engineers get in the same situation too.
I wonder if comments like this are more of a reflection on how bad the hiring pool was even a few years ago than a reflection of how capable LLMs are. I would be distraught if I hired a junior eng with less wherewithal and capabilities than Sonnet 3.7.
materiallie 7 hours ago [-]
This is a very friendly and cordial response. Given that the parent comment was implying that the creators of Zed don't actually know how to build software. Based on their credentials building Rails crud apps, I suppose.
lowsong 10 hours ago [-]
> it works about as well, if not better, than a human junior engineer.
I see this line of reasoning a lot from AI-advocates and honestly it's depressing. Do you see less experienced engineers as nothing more than outputters of code? Is the entire point of being "junior" at something that you can learn and grow, which these LLM tools cannot.
zamadatix 7 hours ago [-]
They're just comparing levels of work output but you're the one assuming that must mean a junior has no other value worth engaging.
kordlessagain 10 hours ago [-]
That's not a line of reasoning. It's an opinion, and they matter. You don't get to make opinions go away just because you don't like them and want to conflate problem sets.
lowsong 10 hours ago [-]
I'm not disputing that they believe that these models are "as good as a junior engineer", by whatever metric you want to measure that on. My point is the very fact that someone uses that as an argument in support LLMs is... profoundly sad.
raincole 4 hours ago [-]
> The author does not understand what LLMs and coding tools are capable of today.
Uh...
This author is developing "LLMs and coding tools of today." It's not like they're just making a typical CRUD Rails app.
chollida1 21 hours ago [-]
Most of this might be true for LLM's but years of investing experience has created a mental model of looking for the tech or company that sucks and yet keeps growing.
People complained endlessly about the internet in the early to mid 90s, its slow, static, most sites had under construction signs on them, your phone modem would just randomly disconnect. The internet did suck in alot of ways and yet people kept using it.
Twitter sucked in the mid 2000s, we saw the fail whale weekly and yet people continued to use it for breaking news.
Electric cars sucked, no charging, low distance, expensive and yet no matter how much people complain about them they kept getting better.
Phones sucked, pre 3G was slow, there wasn't much you could use them for before app stores and the cameras were potato quality and yet people kept using them while they improved.
Always look for the technology that sucks and yet people keep using it because it provides value. LLM's aren't great at alot of tasks and yet no matter how much people complain about them, they keep getting used and keep improving through constant iteration.
LLM"s amy not be able to build software today, but they are 10x better than where they were in 2022 when we first started using chatgpt. Its pretty reasonable to assume in 5 years they will be able to do these types of development tasks.
freehorse 21 hours ago [-]
At the same time, there have been expectations about many of these that did not meet reality at any point. Much of this is due to physical limitations that are not trivial to be overcome. Internet gets faster and more stable, but the metaverse taking over did not happen partially because many people still get nausea after a bit and no 10x scaling fixed that.
A lot of what you described as "sucked" were not seen as "sucking" at the time. Nobody complained about the phones being slow because nobody expected to use phones the way we do today. The internet was slow and less stable but nobody complained because they expected to stream 4k movies and they could not. This is anachronistic.
The fact that we can see how some things improved in X Y manner does not mean that LLMs will improve the way you think they will. Maybe we invent a different technology that does a better job. After it was not that dial up itself became faster and I don't think there were fanatics saying that dialup technology would give us 1Gbp speeds. The problem with AI is that because scaling up compute has provided breakthroughs, some think that somehow with scaling up compute and some technical tricks we can solve all the current problems. I don't think that anybody can say that we cannot invent a technology that can overcome these, but if LLMs is this technology that can just keep scaling has been under doubt. Last year or so there has been a lot of refinement and broadening of applications, but nothing like a breakthrough.
andreasmetsala 20 hours ago [-]
> but the metaverse taking over did not happen partially because many people still get nausea after a bit and no 10x scaling fixed that.
Has VR really improved 10x? I lost touch after the HTC Vive and heard about Valve Index but I was under the impression that even the best that Apple has on offer is 2x at most.
jdiff 20 hours ago [-]
I think you're reading a little far into it, the number 10x was used prior so it was used there in demonstrating that there are some problems that scaling can't fix, it's not a statement on how far VR has come or not.
20 hours ago [-]
runako 21 hours ago [-]
> Phones sucked, pre 3G was slow, there wasn't much you could use them for before app stores and the cameras were potato quality
This is a big rewrite of history. Phones took off because before mobile phones the only way to reach a person was to call when they were at home or their office. People were unreachable for timespans that now seem quaint. Texting brought this into async. The "potato" cameras were the advent of people always having a camera with them.
People using the Nokia 3210 were very much not anticipating when their phones would get good, they were already a killer app. That they improved was icing on the cake.
ARandumGuy 20 hours ago [-]
> People using the Nokia 3210 were very much not anticipating when their phones would get good, they were already a killer app. That they improved was icing on the cake.
It always bugs me whenever I hear someone defend some new tech (blockchain, LLMs, NFTs) by comparing it with phones or the internet or whatever. People did not need to be convinced to use cell phones or the internet. While there were absolutely some naysayers, the utility and usefulness of these technologies was very obvious by the time they became available to consumers.
But also, there's survivorship bias at play here. There are countless promising technologies that never saw widespread adoption. And any given new technology is far more likely to end up as a failure then it is to become "the next iPhone" or "the new internet."
In short, you should sell your technology based on what it can do right now, instead of what it might do in the future. If your tech doesn't provide utility right now, then it should be developed for longer before you start charging money for it. And while there's certainly some use for LLMs, a lot of the current use cases being pushed (google "AI overviews", shitty AI art, AIs writing out emails) aren't particularly useful.
fragmede 20 hours ago [-]
The technology to look at is shopping carts. They're obvious to us now, but when they were first introduced, stores hired actors to use them so that real customers would adopt the habit. There are various "killer" apps that are already currently very useful for their users, but they'll take a while to percolate out as people discover them. That you don't agree with what the corpos are pushing is their bad.
ARandumGuy 19 hours ago [-]
But that's just more cherry-picking. You can always find some past success to push whatever point you're trying to make. But just because shopping carts were a huge hit doesn't mean that whatever you're trying to push will be.
For example, it would be wrong for me to say that "hyperloop got a ton of hype and investments, and it failed. Therefore LLMs, which are also getting a ton of hype and investments, will also fail." Hyperloop and LLMs are fundamentally different technologies, and the failure of hyperloop is a poor indicator of whether LLMs will ultimately succeed.
Which isn't to say we can't make comparisons to previous successes or failures. But those comparisons shouldn't be your main argument for the viability of a new technology.
normie3000 10 hours ago [-]
> But just because shopping carts were a huge hit doesn't mean that whatever you're trying to push will be.
It may have helped that shopping carts were actively designed to be pushed.
fragmede 19 hours ago [-]
Unfortunately my time machine is in the shop, so I don't know what the future looks like, so looking for comparisons is just my way of looking into the future.
My main argument for the viability of the technology is that it's useful today. Even if it doesn't improve from here, my job as a coder has already been changed.
bluefirebrand 8 hours ago [-]
> Even if it doesn't improve from here, my job as a coder has already been changed.
This is so annoying to me. My job as a coder hasn't changed because my responsibilities as a coder hasn't changed
Whether or not I beg an LLM to write code for me or write it myself the job is the same. At best there's a new tool to use but the job hasn't changed.
fragmede 7 hours ago [-]
The responsibilities haven't changed, but the amount of time I have to spend reading documentation to regurgitate something that matches the docs in just the right way has plummeted. That wasn't the whole job, no, but that was a component of my job and to pretend otherwise would be dishonest of me. I don't know you so I don't know how much of your job was that aspect. I will be transparent and say that it did add up over a month though. Says more about me and my job than anything else though, I suppose.
komali2 8 hours ago [-]
People used to fill their bags with produce, bundles or bags of fish and meat, and here and there a couple bags or boxes of dry goods.
Carts were a necessity to get people to interact with the new "center aisles" of the grocery store which is mostly full of boxed and canned garbage.
sidewndr46 19 hours ago [-]
As others have mentioned you are just writing your own history to suit your narrative. There is no evidence to support "People complained endlessly about the internet in the early to mid 90s,".
In the early and 1990s, people effectively did not use the internet. Usage was tiny and miniscule, limited to only tiny niche groups. People heard about the internet via the 90 second blurb on the evening new show. It wasn't until sometime after the launch of Facebook that the internet was even mainstream. So I really don't think people complained about the internet being slow that they weren't using.
I can go on here, but I don't really need to spend paragraphs refuting something that is obviously false.
gyomu 14 minutes ago [-]
> you are just writing your own history to suit your narrative
Classic LLM behavior
chollida1 52 minutes ago [-]
> I can go on here, but I don't really need to spend paragraphs refuting something that is obviously false.
Ha, generally when someone can't disprove something its because they don't have a valid point. You not being able to disprove my point is very telling:)
area51org 18 hours ago [-]
Having lived in that era: no one "complained endlessly", or even at all, about the internet. It was seen as magical. When compared to not existing at all, being slow wasn't all that awful.
skydhash 16 hours ago [-]
I remember using the internet around 2005 and you could hold a conversation while waiting for the page to load. No one complains, because you have a wealth of information at your fingertips. It was actually amazing to chat with someone anywhere in the world or to be able to browse some forums.
bunderbunder 21 hours ago [-]
This is such selective hindsight, though. We remember the small minority of products that persisted and got better. We don't remember the majority of ones that fizzled out after the novelty wore off, or that ultimately plateaued.
Me, I agree with the author of the article. It's possible that the technology will eventually get there, but it doesn't seem to be there now. And I prefer to make decisions based on present-day reality instead of just assuming that the future I want is the future I'll get.
chollida1 20 hours ago [-]
> This is such selective hindsight, though.
Ha;) Yes, when you provide examples to prove your point they are, by definition, selective:)
You are free to develop your own mental models of what technology and companies to invest in. I was only trying to share my 20 years of experience with investing to show why you shouldn't discard current technology because of its current limits.
bunderbunder 20 hours ago [-]
Fair, but also, investing is kind of its own thing because it's inherently trying to predict the future based on partial information today.
Engineering decisions, which is closer to what TFA is talking about, tend to have to be a lot more focused on the here & now. You can make bets on future R&D developments (e.g, the Apollo program), but that's a game best played when you also have some control over R&D budgeting and direction (e.g, the Apollo program), and when you don't have much other choice (e.g, the Apollo program).
overgard 21 hours ago [-]
I'm not a fan of the argument that LLMs have gotten X times better in the past few years, so thusly they will continue to get X times better in the next few years. From what I can see, all the growth has mostly come from optimizing a few techniques, but I'm not convinced that we aren't going to get stuck in a local maxima (actually, I think that's the most likely outcome).
Specifically, to me the limitation of LLMs is discovering new knowledge and being able to reason about information they haven't seen before. LLMs still fail at things like counting the number of b's in the word blueberry or not getting distracted by inserting random cat facts in word problems (both issues I've seen appear in the last month)
I don't mean that to say they're a useless tool, I'm just not into the breathless hype.
The latest releases are seeing smaller and smaller improvements, if any. Unless someone can explain the technical reasons why they're likely to scale to being able to do X then it's a pretty useless claim
masterj 20 hours ago [-]
> LLM"s amy not be able to build software today, but they are 10x better than where they were in 2022 when we first started using chatgpt. Its pretty reasonable to assume in 5 years they will be able to do these types of development tasks.
We can expect them to be better in 5 years, but your last assertion doesn't follow. We can't assert with any certainty that they will be able to specifically solve the problems laid out in the article. It might just not be a thing LLMs are good at, and we'll need new breakthroughs that may or may not appear.
mbesto 20 hours ago [-]
We also thought 3D printing would print us a car, but alas.
FWIW - 3d printing has come a far way, and I personally have a 3D printer. But the idea that it was going to completely disrupt manufacturing is simply not true. There are known limitations (how the heck are you going to get a wood polymer squeezed through a metal tip?) and those limitations are physics, not technical ones.
chollida1 20 hours ago [-]
Agreed on 3D printing but that is a technology that would have failed my screening as proposed.
They haven't continued to see massive adoption and improvement despite the flaws people point out.
They had initial success at printing basic plastic pieces but have failed to print in other materials like metal as you correctly point out, so these wouldn't pass my screening as they currently sit.
fragmede 19 hours ago [-]
The fact that I needed a bag clip and just have to search on an app on my phone for one and hit print, mostly trouble-free, says that it's here. Sure, spending $1500 to save $3 isn't economically optimal, but 3d printing has disrupted things. Just look at the SpaceX rocket engines.
mrheosuper 3 hours ago [-]
> Electric cars sucked, no charging, low distance, expensive and yet no matter how much people complain about them they kept getting better.
It takes them over a century to get to this current point.
jansper39 2 hours ago [-]
They've not been in active development for that time though, only really the last 12 years.
fmbb 20 hours ago [-]
People also complained a lot about VR.
And NFTs had a lot of loud detractors.
And everyone complained about a million other solutions that did not go anywhere.
Still, a bunch of investors made a lot of money on VR and very much so on NFT. Investments being good is not an indicator of anything being useful.
danielbln 20 hours ago [-]
I use LLMs every single day, for hours. Iw as suuuuuuper into VR in early-mid 2010s but even that didn't see that much adoption among my peers, whereas everyone is using LLMs.
And NFTs was always perceived as a scam, same as the breathless blockchain no sense.
LLMs have many many issues, but I think they stick out as different to the other examples.
jarjoura 19 hours ago [-]
I see a bit of distinction here, that the foundation models aren't actually 10x better than in 2022. What's improved though is that we have far more domain knowledge of how to get more out of slightly improved models.
So consider your analogy, that the internet was always useful, but it was javascript that caused the actual titanic shift in the software industry. Even though the core internet backbone didn't radically improve as fast as you imagine it would have. Javascript was hacked together as a toy scripting language meant to make pages more interactive, but turns out, it was the key piece in unlocking that 10x value of the already existing internet.
Agents and the explosion of all these little context services are where I see the same thing happening here. Right now they are buggy, and mostly experimental toys. However, they are unlocking that 10x value.
skydhash 16 hours ago [-]
> Javascript was hacked together as a toy scripting language meant to make pages more interactive, but turns out, it was the key piece in unlocking that 10x value of the already existing internet.
Was it? I remember a lot more installable software than you do being the core usage of computers. Even today, most people are using apps.
ausbah 21 hours ago [-]
those are really good points, but LLMs have really started to plateau off on their capabilities haven’t they? the improvements from gpt2 class models to 3 was much bigger then 3 to 4, which was only somewhat bigger than 4 to 5
most of the vibe shift I think I’ve seen in the past few months to using LLMs in the context of coding has been improvements in dataset curation and ux, not fundamentally better tech
worldsayshi 21 hours ago [-]
> LLMs have really started to plateau
That doesn't seem unexpected. Any technological leap seem to happen in sigmoid-like steps. When a fruitful approach is discovered we run to it until diminishing returns sets in. Often enough a new approach opens doors to other approaches that builds on it. It takes time to discover the next step in the chain but when we do we get a new sigmoid-like leap. Etc...
worldsayshi 21 hours ago [-]
Personally my bet for the next fruitful step is something in line with what Victor Taelin [1] is trying to achieve.
I.e. combining new approaches around old school "AI" with GenAI. That's probably not exactly what he's trying to do but maybe somewhere in the ball park.
Started? In my opinion they haven't gotten better since the release of ChatGPT a few years ago. The weaknesses are still just as bad, the strengths have not improved. Which is why I disagree with the hype saying they'll get better still. They don't do the things they are claimed to today, and haven't gotten better in the last few years. Why would I believe that they'll achieve even higher goals in the future?
Closi 19 hours ago [-]
I assume you don’t use these models frequently, because there is a staggering difference in response quality from frontier LLMs compared to GPT 3.
Go open the OpenAI API playground and give GPT3 and GPT5 the same prompt to make a reasonably basic game in JavaScript to your specification and watch GPT 3 struggle and GPT 5 one-shot it.
globular-toast 6 hours ago [-]
Sure but it's kinda like a road then never quite gets you anywhere. It seems to get closer and closer to the next town all the time, but ultimately it's still not there yet and that's all that really matters.
DanielHB 21 hours ago [-]
All the other things he mentioned didn't rely on breakthroughs, LLMs really do seem to have reached a plateau and need a breakthrough to push along to the next step.
Thing is breakthroughs are always X years away (50 for fusion power for example).
The only example he gave that actually was kind of a big deal was mobile phones where capacitive touchscreens really did catapult the technology forward. But it is not like celphones weren't already super useful, profitable and getting better over time before capacitive touchscreens were introduced.
Maybe broadband to the internet also qualifies.
Closi 7 hours ago [-]
> All the other things he mentioned didn't rely on breakthroughs, LLMs really do seem to have reached a plateau and need a breakthrough to push along to the next step.
I think a lot of them relied on gradual improvement and lots of 'mini-breakthroughs' rather than one single breakthrough that changes everything. These mini-breakthroughs took decades to realise themselves properly in almost every example on the list too, not just a couple of years.
My personal gut feel is that even if the core technology plateau's, there's still lots of iterative improvement to go after on the productisation/commercialisation of the existing technology (e.g. improving tooling/ui/applying it to solving real problems/productising current research etc).
In electric car terms - we are still at the stage where Tesla is shoving batteries in a lotus elise, rather than releasing the model 3. We might have the lithium polymer batteries, but there's still lots of work to do to pull it into the final product.
(Having said this - I don't think the technology has plateau'd - I think we are just looking at it across a very narrow time span. If in 1979 you said that computers had plateau'd in 1979 because there hadn't been much progress in the last 12 months they would have been very wrong - breakthrough's sometimes take longer as technology matures, but that doesn't mean that the technology two decades from now won't be substantially different.
imtringued 1 hours ago [-]
There also is an absolutely massive gap between Llama 2 and Llama 3. The Llama 3.1 models represent the beginning of usable open weight models. Meanwhile Llama 4 and its competitors seem to be incremental improvements.
Yes, the newest models are so much better that they obsolete the old ones, but now the biggest differences between models is primarily what they know (parameter count and dataset quality) and how much they spend thinking (compute budget).
stpedgwdgfhgdd 20 hours ago [-]
There is a big difference between Claude Code today and 6 months ago. Perhaps the LLMs plateau, but the tooling not.
NitpickLawyer 20 hours ago [-]
> but LLMs have really started to plateau off on their capabilities haven’t they?
Uhhh, no?
In the past month we've had:
- LLMs (3 different models) getting gold at IMO
- gold at IoI
- beat 9/10 human developers at atcode heuristics (optimisations problems) with the single human that actually beat the machine saying he was exhausted and next year it'll probably be over.
- agentic that actually works. And works for 30-90 minute sessions while staying coherent and actually finishing tasks.
- 4-6x reduction in price for top tier (SotA?) models. oAI's "best" model now costs 10$/MTok, while retaining 90+% of their previous SotA models that were 40-60$/MTok.
- several "harnesses" being released by every model provider. Claude code seems to remain the best, but alternatives are popping off everywhere - geminicli, opencoder, qwencli (forked, but still), etc.
- opensource models that are getting close to SotA, again. Being 6-12months behind (depending on who you ask), opensource and cheap to run (~2$/MTok on some providers).
I don't see the plateauing in capabilities. LLMs are plateauing only in benchmarks, where number goes up can only go up so far until it becomes useless. IMO regular benchmarks have become useless. MMLU & co are cute, but agentic whatever is what matters. And those capabilities have only improved. And will continue to improve, with better data, better signals, better training recipes.
Why do you think eveyr model provider is heavily subsidising coding right now? They all want that sweet sweet data & signals, so they can improve their models.
cameronh90 21 hours ago [-]
I'm not sure I'd describe it as a plateau. It might be, but I'm not convinced. Improvements are definitely not as immediately obvious now, but how much of that is due to it being more difficult to accurately gauge intelligence above a certain point? Or even that the marginal real life utility of intelligence _itself_ starts to plateau?
A (bad) analogy would be that I can pretty easily tell the difference between a cat and an ape, and the differences in capability are blatantly obvious - but the improvement when going from IQ 70 to Einstein are much harder to assess and arguably not that useful for most tasks.
I tend to find that when I switch to a new model, it doesn't seem any better, but then at some point after using it for a few weeks I'll try to use the older model again and be quite surprised at how much worse it is.
einrealist 20 hours ago [-]
> Twitter sucked [...] Electric cars sucked [...] Phones sucked
All these things are not black boxes and they are mostly deterministic. Based on the inputs, you EXACTLY know what to expect as output.
That's not the case with LLMs, how they are trained and how they work internally.
We certainly get a better understanding on how to adjust the inputs so we get a desired output. But that's far from guaranteed at the same level as the examples you mentioned.
That's a fundamental problem with LLMs. And you can see that in how industry actors are building solutions around that problem. Reasoning (chain-of-thought) is basically a band-aid to narrow a decision tree, because the LLM does not really "reason" about anything. And the results only get better with more training data. We literally have to brute-force useful results by throwing more compute and memory at the problem (and destroying the environment and climate by doing so).
The stagnation of recent model releases does not look good for this technology.
isoprophlex 20 hours ago [-]
Now think about hoverboards, self-cleaning shirts, moon bases, flying cars, functioning democracies, whatever VR tech is described in snow crash as well. Where on the spectrum will LLMs fall?
4b11b4 20 hours ago [-]
"it's pretty reasonable".. big jump?
17 hours ago [-]
tjtryon 19 minutes ago [-]
I find that if you are great at pseudocode, and great at defining a programs logic in the pseudocode, along with defining all cases possible, and trying to add user id10t error issue to the logic, you can get a pretty good framework of a program to start to work with. I also ask for comments for all logic, loops, if statements and functions that are instructional, yet easy enough that a 5th grader could understand. This framework usually comes as a version .8 for further development manually or even .9 beta test/debugging level. Occasionally, I've seen version 1.0 release candidate 1 level work, where I need to try to verify the functionality, AND try to find ways that users may try that will break functionality. Whichever version I end up with, it still involves manual coding, there is no escaping that. Using a LLM just saves (a lot) of time with the initial framework and program logic framework.
JimDabell 21 hours ago [-]
LLMs can’t build software because we are expecting them to hear a few sentences, then immediately start coding until there’s a prototype. When they get something wrong, they have a huge amount of spaghetti to wade through. There’s little to no opportunity to iterate at a higher level before writing code.
If we put human engineering teams in the same situation, we’d expect them to do a terrible job, so why do we expect LLMs to do any better?
We can dramatically improve the output of LLM software development by using all those processes and tools that help engineering teams avoid these problems:
yup. I started a fully autonomous, 100% vibe coded side project called steadytext, mostly expecting it to hit a wall, with LLMs eventually struggling to maintain or fix any non-trivial bug in it. turns out I was wrong, not only has claude opus been able to write up a pretty complex 7k LoC project with a python library, a CLI, _and_ a postgres extension. It actively maintains it and is able to fix filed issues and feature requests entirely on its own. It is completely vibe coded, I have never even looked at 90% of the code in that repo. it has full test coverage, passes CI, and we use it in production!
granted- it needs careful planning for CLAUDE.md and all issues and feature requests need a lot of in-depth specifics but it all works. so I am not 100% convinced by this piece. I'd say it's def not easy to get coding agents to be able to manage and write software effectively and specially hard to do so in existing projects but my experience has been across that entire spectrum. I have been sorely disappointed in coding agents and even abandoned a bunch or projects and dozens of pull requests but I have also seen them work.
Thanks for sharing this! It's difficult to find good examples of useful codebases where coding agents have done most of the work. I'm always actively looking at how I can push these agents to do more for me and it's very instructive to hear from somebody who has had success on this level. (Would be nice to read a writeup, too)
aethrum 19 hours ago [-]
Huh, interesting. Though I do wonder if the best possible thing an AI could help code would be another AI tool
itsalotoffun 6 minutes ago [-]
This way to the hard take-off.
18 hours ago [-]
sjdbdjskbzba 19 hours ago [-]
[dead]
tossandthrow 6 hours ago [-]
We don't expect humans to do a terrible job - we just expect them to facilitate the process.
If the LLM started sketching up screens and asked questions back about the intention of the software, then I am sure people would have a much better experience.
jarjoura 18 hours ago [-]
Okay, I'm willing to entertain your cynical take. However, experience has shown me that if we need to solve a vague problem as a team of engineers and designers, we know to get ample context of what it is we're actually trying to build.
Plus, the most creative solutions often comes from implicit and explicit constraints. This is entirely a human skill and something we excel at.
These LLMs aren't going to "consider" something, understand the constraints, and then fit a solution inside those constraints that weren't explicitly defined for it somehow. If constraints aren't well understood, either through common problems, or through context documents, it will just go off the deep end trying to hack something together.
So right now we still need to rely on humans to do the work of breaking problems down, scoping the work inside of those constraints, and then coming up with a viable path forward. Then, at that point, the LLM becomes just another way to execute on that path forward. Do I use javascript, rust, or Swift to write the solution, or do I use `CLAUDE.md` with these 30 MCP services to write the solution.
For now, it's just another tool in the toolbox at getting to the final solution. I think the conversations around it needing to be a binary either, all or nothing, is silly.
bagacrap 18 hours ago [-]
There are a lot of human engineers who do a fine job in these situations, akshwally.
If it isn't easy to give commands to LLMs, then what is the purpose of them?
imtringued 1 hours ago [-]
>If we put human engineering teams in the same situation, we’d expect them to do a terrible job, so why do we expect LLMs to do any better?
Because LLMs were trained for one shot performance and they happen to beat humans at that.
otterley 20 hours ago [-]
This is the approach that Kiro is taking, although it’s early days. It’s not perfect but it does produce pretty good results if you adhere to its intent.
quantumHazer 20 hours ago [-]
a 1 minutes research on the internet led me to discover that you are MARKETING MANAGER at amazon. so your take is full of conflict of interest and this should be disclosed.
otterley 18 hours ago [-]
Fair enough and I apologize for not disclosing it. However, Kiro is not a service in scope for me, and this is my own opinion, not that of the company.
(Also, there is no conflict of interest here, and you do not need to yell. I’m free to criticize my company if I like.)
myflash13 45 minutes ago [-]
This was Peter Naur’s observation in his 1985 paper Programming as Theory Building. One of the best papers I’ve ever read. Programming IS mostly building castles in the air of your head, not the code on the screen.
If you do the thinking and let the LLM do the typing it works incredibly well. I can write code 10x faster with AI, but I’m maintaining the mental model in my head, the “theory” as Naur calls it. But if you try to outsource the theory to the LLM (build me an app that does X) you’re bound to fail in horrible ways. That’s why Claude Code is amazing but Replit can only do basic toy apps.
9cb14c1ec0 21 hours ago [-]
> what they cannot do is maintain clear mental models
The more I use claude code, the more frustrated I get with this aspect. I'm not sure that a generic text-based LLM can properly solve this.
dlivingston 21 hours ago [-]
Reminds me of how Google's Genie 3 can only run for a ~minute before losing its internal state [0].
My gut feeling is that this problem won't be solved until some new architecture is invented, on the scale of the transformer, which allows for short-term context, long-term context, and self-modulation of model weights (to mimic "learning"). (Disclaimer: hobbyist with no formal training in machine learning.)
It’s the nature of formal system. Someones need to actually do the work of defining those rules or have a smaller set of rules that can generate the larger set. But anytime you invent a rule. That means a few things that are possible can’t be represented in the system. You’re mostly hoping that those things aren’t meaningful.
LLMs techniques allows us to extract rules from text and other data. But those data are not representative of a coherent system. The result itself is incoherent and lacks anything that wasn’t part of the data. And that’s normal.
It’s the same as having a mathematical function. Every point that it maps to is meaningful, everything else may as well not exists.
elephanlemon 21 hours ago [-]
I’ve been thinking about this recently… maybe a more workable solution at the moment is to run a hierarchy of agents, with the top level one maintaining the general mental model (and not filling its context with anything much more than “next agent down said this task was complete”). Definitely seems like anytime you try to have one Code agent run everything it just goes off the rails sooner or later, ignoring important details from your original instructions, failing to make sure it’s adhering to CLAUDE.md, etc. I think you can do this now with Code’s agent feature? Anyone have strategies to share?
skydhash 21 hours ago [-]
Telephone game don’t work that well. That’s how an emperor can be isolated in his palace and every edict becomes harmful. It’s why architect/developer didn’t work. You need to be aware of all the context you need to make sure you’ve done a good job
That and other tricks have only made me slightly less frustrated, though.
cmrdporcupine 21 hours ago [-]
Honestly it forces you -- rightfully -- to step back and be the one doing the planning.
You can let it do the grunt coding, and a lot of the low level analysis and testing, but you absolutely need to be the one in charge on the design.
It frankly gives me more time to think about the bigger picture within the amount of time I have to work on a task, and I like that side of things.
There's definitely room for a massive amount of improvement in how the tool presents changes and suggestions to the user. It needs to be far more interactive.
mock-possum 21 hours ago [-]
That’s my experience as well - I’m the one with the mental model, my responsibility is using text to communicate that model to the LLM using language it will recognize from its training data to generate the code to follow suit.
My experience with prompting LLMs for codegen is really not much different from my experience with querying search engines - you have to understand how to ‘speak the language’ of the corpus being searched, in order to find the results you’re looking for.
micromacrofoot 21 hours ago [-]
Yes this is exactly it, you need to talk to Claude about code on a design/architecture level... just telling it what you want the code to output will get you stuck in failure loops.
I keep saying it and no one really listens: AI really is advanced autocomplete. It's not reasoning or thinking. You will use the tool better if you understand what it can't do. It can write individual functions pretty well, stringing a bunch of them together? not so much.
It's a good tool when you use it within its limitations.
SoftTalker 21 hours ago [-]
Is this really that diffferent from the "average" programmer, especially a more junior one?
> LLMs get endlessly confused: they assume the code they wrote actually works; when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over.
I see this constantly with mediocre developers. Flailing, trying different things, copy-pasting from StackOverflow without understanding, ultimately deciding the compiler must have a bug, or cosmic rays are flipping bits.
layer8 20 hours ago [-]
The article explicitly calls out that that’s what they are looking for in a competent software engineer. That incompetent developers exist, and that junior developers tend to not be very competent yet, doesn’t change anything about that. The problem with LLMs is that they’re already the final product of training/learning, not the starting point. The (in)ability of an LLM to form stable mental models is fixed in its architecture, and isn’t anything you can teach them.
SoftTalker 20 hours ago [-]
I just (re) read the article and the word "competent" doesn't appear in it. It doesn't discuss human developer competency at all, except in comparison to LLMs.
layer8 20 hours ago [-]
Yes, I replaced “effective” by “competent” in my response, because I found that word slightly preferable in the context discussed.
Xss3 20 hours ago [-]
I feel like something is wrong where you are, maybe your juniors do not feel incentivized or encouraged to learn, code reviews might not be strict enough, quality may not be valued enough, and immense pressure to move tickets might be put on people, or all of the above in various doses.
I feel this way because at my company our interns on a gap year from their comp sci degree don't blame the compiler, cosmic bits, or blindly copy from stack overflow.
They're incentivized and encouraged to learn and absolutely choose to do so. The same goes for seniors.
If you say 'I've been learning about X for ticket Y' in the standup people basically applaud it, managers like us training ourselves to be better.
Sure managers may want to see a brief summary or a write-up applicable to our department if you aren't putting code down for a few days, but that's the only friction.
hahn-kev 20 hours ago [-]
I find it impressive that LLMs can so closely mimic the behaviour of a junior dev. Even if that's not a desirable outcome it's still impressive and interesting.
nicbou 5 hours ago [-]
I have never tried seriously coding with AI. I just ask ChatGPT for snippets that I can verify, to save a few round trips to Google and API docs.
However the other day I gave ChatGPT a relatively simple assignment, and it kept ignoring the rules. Every time I corrected it, it broke a different rule. I was asking it for gender-neutral names, but it kept giving last names like Orlov (which becomes Orlova), or first names that are purely masculine.
Is it the same with vibe coding?
parimal7 5 hours ago [-]
Has been my experience as well, never used ChatGPT for anything but as an interface to external documentations and understanding unfamiliar syntax.
Tried using it for the first time for vibe coding and was quite disappointed with the overall result, felt like a college student hastily copy pasting code from different sources for a project due tomorrow.
Maybe I just gave bad prompts…
energy123 4 hours ago [-]
It's the same. Every assumption, taste, doc, opinion, edge case, unstated knowledge, requirement, exception to the requirement, intention, should be provided for best results. It's not crucial for throwaway weekend projects, but for harder things, it is.
I find it to be the most challenging part. There's a large amount of unstated assumptions that you take for granted, and if you don't provide them all upfront, you'll need to regenerate the code, again and again. I now invest a lot of time into writing all this down before I generate any code.
arendtio 1 hours ago [-]
This is the post that the people at Anthropic and Cursor should read.
> But what they cannot do is maintain clear mental models.
The emphasis should be on maintain. At some point, the AI tends to develop a mental model, but over time, it changes in unexpected ways or becomes absent altogether. In addition, the quality of the mental models is often not that good to begin with.
kleyd 3 hours ago [-]
Current LLMs look a lot like a very advanced 'old brain' to me. While context engineering looks like optimizing the working memory.
What's missing is a part with more plasticity that can work in parallel and bi-directionally interact with the current static models in real-time.
This would mean individually trained models based on their experience so that knowledge is not translated to context, but to weight adjustments.
movpasd 2 hours ago [-]
That's also my view. It's clear that these models are more than pure language algorithms. Somewhere within the hidden layers are real, effective working models of how the world works. But the power of real humans is the ability to learn on-the-fly.
Disclaimer: These are my not-terribly-informed layperson's thoughts :^)
The attention mechanism does seem to give us a certain adaptability (especially in the context of research showing chain-of-thought "hidden reasoning") but I'm not sure that it's enough.
Thing is, earlier language models used recurrent units that would be able to store intermediate data, which would give more of a foothold for these kind of on-the-fly adjustments. And here is where the theory hits the brick wall of engineering. Transformers are not just a pure machine learning innovation, the key is that they are massively scalable, and my understand is part of this comes from the _lack_ of recurrence.
I guess this is where the interest in foundation models comes from. If you could take a codebase as a whole and turn it into effective training data to adjust the weights of an existing, more broadly-trained model, But is this possible with a single codebase's worth of data?
Here again we see the power of human intelligence at work: the ability to quite consciously develop new mental models even given very little data. I imagine this is made possible by leaning on very general internal world-models that let us predict the outcomes of even quite complex unseen ("out-of-distribution") situations, and that gives us extra data. It's what we experience as the frustrations and difficulties of the learning process.
lordnacho 21 hours ago [-]
I think I agree with the idea that LLMs are good at the junior level stuff.
What's happened for me recently is I've started to revisit the idea that typing speed doesn't matter.
This is an age-old thing, most people don't think it really matters how fast you can type. I suppose the steelman is, most people think it doesn't really matters how fast you can get the edits to your code that you want. With modern tools, you're not typing out all the code anyway, and there's all sorts of non-AI ways to get your code looking the way you want. And that doesn't matter, the real work of the engineer is the architecture of how the whole program functions. Typing things faster doesn't make you get to the goal faster, since finding the overall design is the limiting thing.
But I've been using Claude for a while now, and I'm starting to see the real benefit: you no longer need to concentrate to rework the code.
It used to be burdensome to do certain things. For instance, I decided to add an enum value, and now I have to address all the places where it matches on that enum. This wasn't intellectually hard in the old world, you just got the compiler to tell you where the problems were, and you added a little section for your new value to do whatever it needed, in all the places it appeared.
But you had to do this carefully, otherwise you would just cause more compile/error cycles. Little things like forgetting a semicolon will eat a cycle, and old tools would just tell you the error was there, not fix it for you.
LLMs fix it for you. Now you can just tell Claude to change all the code in a loop until it compiles. You can have multiple agents working on your code, fixing little things in many places, while you sit on HN and muse about it. Or perhaps spend the time considering what direction the code needs to go.
The big thing however is that when you're no longer held up by little compile errors, you can do more things. I had a whole laundry list of things I wanted to change about my codebase, and Claude did them all. Nothing on the business level of "what does this system do" but plenty of little tasks that previously would take a junior guy all day to do. With the ability to change large amounts of code quickly, I'm able to develop the architecture a lot faster.
It's also a motivation thing: I feel bogged down when I'm just fixing compile errors, so I prioritize what to spend my time on if I am doing traditional programming. Now I can just do the whole laundry list, because I'm not the guy doing it.
AstroBen 17 hours ago [-]
> Typing things faster doesn't make you get to the goal faster, since finding the overall design is the limiting thing.
interesting point and that matches my experience quite well. LLMs have been horrendous at creating a good design. Even on a micro scale I almost always have them refactor the functions they write
I certainly get a productivity boost at actually doing the implementation.. but the implementation is already there in my head or on paper. It's really hard to know the true improvement
I do find them useful for brainstorming. I can throw a bunch of code and tests at it and ask what edge cases I might want to consider, or anything I've missed. 9/10 of their suggestions I just skip over but often there's a few I integrate
Getting something that works vs creating something that'll do well in the medium-long term is just such a different thing that I'm not sure if they'll be able to improve at the second
ethan_smith 8 hours ago [-]
The real productivity gain isn't just typing speed but cognitive offloading - though we must be careful this doesn't atrophy our ability to maintain accurate mental models since delegating implementation details can disconnect us from important system nuances.
ambicapter 21 hours ago [-]
> I had a whole laundry list of things I wanted to change about my codebase
I always have a whole bunch of things I want to change in the codebase I'm working on, and the bottleneck is review, not me changing that code.
lordnacho 21 hours ago [-]
Those are the same thing though? You change the code, but can't just edit it without testing it.
LLM also helps you test.
marcosdumay 20 hours ago [-]
Review is not test. Testing does almost not help making your program correct, and does not help at all making your code "good".
Almost every quality software has is designed in from a higher abstraction level. Almost nothing is put there by fixing error after error.
tempodox 18 hours ago [-]
> plenty of little tasks that previously would take a junior guy all day to do.
But that's also where said junior learns something. If those juniors get replaced by machines and not even get hired any more, who is going to teach them?
AstroBen 17 hours ago [-]
How often do companies invest in people to train them? Maybe the smart ones are going to need to start doing that or they'll crash and burn
crabmusket 13 hours ago [-]
The 4 step process outlined at the start of this article really reminds me of Deutsch's The Beginning of Infinity:
> The real source of our theories is conjecture, and the real source of our knowledge is conjecture alternating with criticism.
(This is rephrased Karl Popper, and Popper cites an intellectual lineage beginning somewhere around Parmenides.)
I see writing tests as a criticism of the code you wrote, which itself was a conjecture. Both are attempting to approach an explanation in your mind, some platonic idea that you think you are putting on paper. The code is an attempt to do so, the test is criticism from a different direction that you have done so.
sovietswag 8 hours ago [-]
That quote about conjecture reminds me of a big point from Zen and the Art of Motorcycle Maintenance. The author suggests that 'science' / 'the scientific method' don't actually account for the process by which ideas/hypotheses come into existence, science only comes into play once the hypothesis appears (from whence does it appear?). He calls that magic smoke 'Quality'. (Using the language you cited, I guess we would be asking about where the conjecture itself comes from). I'm realizing now that this is tangential to your point, sorry, but thanks for posting this interesting comment.
emilecantin 21 hours ago [-]
Yeah, I think it's pretty clear to a lot of people that LLMs aren't at the "build me Facebook, but for dogs" stage yet. I've had relatively good success with more targeted tasks, like "Add a modal that does this, take this existing modal as an example for code style". I also break my problem down into smaller chunks, and give them one by one to the LLM. It seems to work much better that way.
bagacrap 18 hours ago [-]
I can already copy paste existing code and tweak it to do what I want (if you even consider that "software engineering"). The difference being that my system clipboard is deterministic, rather than infinitely creative at inventing new ways to screw up.
hahn-kev 20 hours ago [-]
I do wonder how something like v0 would handle that request though.
1zael 21 hours ago [-]
> "when test fail, they are left guessing as to whether to fix the code or the tests"
I've one thing that helps is using the "Red-Green-Refactor" language. We're in RED phase - test should fail. We're in GREEN phase - make this test pass with minimal code. We're in REFACTOR phase - improve the code without breaking tests.
This helps the LLM understand the TDD mental model rather than just seeing "broken code" that needs fixing.
cortesoft 10 hours ago [-]
Sure, but isn't the issue when you are trying to move from the RED phase to the GREEN phase... are you still getting red because the test was bad or because the code isn't working yet?
pjmlp 21 hours ago [-]
Only because most AI startups are doing it wrong.
I don't want a chat window.
I want AI workflows as part of my IDE, like Visual Studio, InteliJ, Android Studio are finally going after.
I want voice controlled actions on my native language.
Knowledge across everything on the project for doing code refactorings, static analysis with AI feedback loop, generating UI based out of handwritten sketches, programming on the go using handwriting, source control commit messages out of code changes,...
lisbbb 7 hours ago [-]
Back around, I don't even know, 2013? A colleague and I were working on updating a system that scanned in letters with mail order forms. The workers would lay the items from the envelopes in order on a conveyor type scanner. They had to lay them down in order: order form, payment check, envelope. The system would scan each document and add two blank fake scanned pages after each envelope. The company that set it up billed by scanned page. We figured out that you didn't need the blank pages as a delimiter because the envelope could reliably serve as that. By the way, the OCR was so bad that they never got the order forms to scan automatically, but people had to examine the order form as a pdf doc and key in everything instead. By eliminating the fake, nonsensical blank scanned pages, we saved the company over $1M/year in costs. We never got a single accolade or pat on the back or anything for that. Can AI do that, though?
generalizations 21 hours ago [-]
These LLM discussions really need everyone to mention what LLM they're actually using.
> AI is awesome for coding! [Opus 4]
> No AI sucks for coding and it messed everything up! [4o]
Would really clear the air. People seem to be evaluating the dumbest models (apparently because they don't know any better?) and then deciding the whole AI thing just doesn't work.
stackbutterflow 19 hours ago [-]
Don't expect any improvement ever.
It happens on many topics related to software engineering.
The web developer is replying to the embedded developer who is replying to the architect-that-doesnt-code who is replying to someone with 2 years of experience who is replying to someone working at google who is replying to someone working at a midsize b2b German company with 4 customers. And on and on.
Context is always omitted and we're all talking about different things ignoring the day to day reality of our interlocutors.
bagacrap 17 hours ago [-]
My experience is that AI enthusiasts will always say, "well you just used the wrong model". And when no existing model works well, they say, "well in 6 months it will work". The utility of agentic coding for complex projects is apparently unfalsifiable.
taormina 21 hours ago [-]
I've used a wide variety of the "best" models, and I've mostly settled on Opus 4 and Sonnet 4 with Claude Code, but they don't ever actually get better. Grok 3-4 and GPT4 were worse, but like, at a certain point you don't get brownie points for not tripping over how low the bar is set.
generalizations 20 hours ago [-]
People have actually been basing their assertions on 4o. The bar is really low and people are still completely missing it.
omnicognate 21 hours ago [-]
What the article says is as true of Opus 4 as any other LLM.
energy123 8 hours ago [-]
> AI is exceptional for coding! [high-compute scaffold around multiple instances / undisclosed IOI model / AlphaEvolve]
> AI is awesome for coding! [Gpt-5 Pro]
> AI is somewhat awesome for coding! ["gpt-5" with verbosity "high" and effort "high"]
> AI is a pretty good at coding! [ChatGPT 5 Thinking through a Pro subscription with Juice of 128]
> AI is mediocre at coding! [ChatGPT 5 Thinking through a Plus subscription with a Juice of 64]
> AI sucks at coding! [ChatGPT 5 auto routing]
troupo 21 hours ago [-]
> These LLM discussions really need everyone to mention what LLM they're actually using.
Do we know which codebases (greenfield, mature, proprietary etc.) people work on? No
Do we know the level of expertise the people have? No.
Is the expertise in the same domain, codebase, language that they apply LLMs to? We don't know.
How much additional work did they have reviewing, fixing, deploying, finishing etc.? We don't know.
--- end quote ---
And that's just the tip of the iceberg. And that is an iceberg before we hit another one: that we're trying to blindly reverse engineer a non-deterministic blackbox inside a provider's blackbox
Transfinity 21 hours ago [-]
> LLMs get endlessly confused: they assume the code they wrote actually works; when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over.
I feel personally described by this statement. At least on a bad day, or if I'm phoning it in. Not sure if that says anything about AI - maybe just that the whole "mental models" part is quite hard.
apples_oranges 21 hours ago [-]
It means something is not understood. Could be the product, the code in question, or computers in general. 90% of coders seem to be lacking foundational knowledge imho. Not trying to hate on anyone, but when you have the basics down, you can usually see quickly where the problem is, or at least must be.
aniviacat 21 hours ago [-]
Unfortunately, "usually" is a key word here.
bagacrap 17 hours ago [-]
So LLMs are always phoning it in, on a bad day, etc. Great.
I recently tried to get AI to refactor some tests, which it proceeded to break. Then it iterated a bit till it had gotten the pass rate back up to 75%. At this point it declared victory. So yes, it does really seem like a human who really doesn't want to be there.
mikewarot 16 hours ago [-]
3 months ago, I would have agreed with much of this article, however...
In the past week, I watched this video[1] from Welch Labs about how deep networks work, and it inspired an idea. I spent some time "vibe coding" with Visual Studio Code's ChatGPT5 preview and had it generate a python framework that can take an image, and teach a small network how to generate that one sample image.
The network was simple... 2 inputs (x,y), 3 outputs (r,g,b), and a number of hidden layers with a specified number of nodes per layer.
It's an agent, it writes code, tests it, fixes problems, and it pretty much just works. As I explored the space of image generation, I had it add options over time, and it all just worked. Unlike previous efforts, I didn't have to copy/paste error messages in and try to figure out how things broke. I was pleasantly surprised that the code just worked in a manner close to what I wanted.
The only real problem I had was getting .venv working right, and that's more of an install issue rather then the LLMs fault.
I've got to say, I'm quite impressed with Python's argparse library.
It's amazing how much detail you can get out of a 4 hidden layers of 64 values, and 3 output channels (rgb), if you're willing to through a few days of CPU time at it. My goal is to see just how small of a network I can make to generate my favorite photo.
As it iterates through checkpoints, I have it output an image with the current values, to compare against the original, it's quite fascinating to watch as it folds the latent space to capture major features of the photo, then folds some more to catch smaller details, over and over, as the signal to noise ratio very slowly increases over the hours.
As for ChatGPT5, maybe I just haven't run out of context window yet, but for now, it all just seems like magic.
Side comment, I love the typography of the site. Easy to read.
21 hours ago [-]
kachapopopow 5 hours ago [-]
I'm tired of people telling me that llms are bad at building software without trying to sit down, learn how to properly use claude code, when to use it and learn when you shouldn't use it.
Cursor is a joke tho, windsurf is pretty okay.
ale 5 hours ago [-]
This is not a you're-holding-the-llm-wrong problem though, AI tools are simply not capable of creating mental models for problem solving. Sounds like you're tired of hearing that LLMs are not a silver bullet.
kachapopopow 4 hours ago [-]
Exactly, there's things you aren't shouldn't do with an llm. But generating helm charts, configs, action workflows, building specs and then implementing based on them? Simply a no-brainer.
TrackerFF 5 hours ago [-]
LLMs are as good as the inputs a person gives them.
Right now the scene is very polarized. You have the "AI is a failure, you can build anything serious, this bubble is going to pop any day now" camp, and the "AI has revolutionized my workflow, I am now 10x more productive" camp.
I mean these types of posts blow up here every. single. day.
kachapopopow 4 hours ago [-]
It's neither useless nor it's 10x. It's a solid 1.2x - 1.8x.
lysecret 19 hours ago [-]
Bit of a click baity title since thy can definitely help in building software.
However, I agree with the main thesis (that they can’t do it on their own). Also related to this this whole idea of “we will easily fix memory next” will turn out to be the same as “we can fix vision in one summer” turned out it’s 30 years later, much improved but still not fixed. Memory is hard.
nowittyusername 20 hours ago [-]
Saying LLMS are not good at x or y, is akin to saying a brain is useless without a body. Which is obvious. The success of agentic coding solutions depends on not just the model but also the system that the developers built around the model. And the companies that will succeed in this area are going to be the companies that focus on building sophisticated and capable systems that utilize said models. We are still in very early days where most organizations are only coming to terms with this realization... Only a few of them fully utilize this concept to the fullest, Claude code being the best example. The Claude models are specifically trained for tool calling and other capabilities and the Claude code cli compliments and takes advantage of those capabilities to the fullest, things like context management among other capabilities are extremely important ...
thrance 2 hours ago [-]
I had an awful, terrible experience with GPT5 a few days ago, that made me remember why I don't use LLMs, and renewed my promise to not use them for at least a year more.
I am a relative newbie to GPU development, and was writing a simple 2D renderer with WebGPU and its rust implementation, wgpu. The goal is to draw a few textures to a buffer, and then draw that buffer to the screen with a CRT effect applied.
I got 99% of the way there on my own, reading the guide, but then got stumped on a runtime error message. Something like "Texture was destroyed while its semaphore wasn't released". Looking around my code, I see no textures ever being released. I decide to give the LLM a go, and ask it to help me, and it very enthusiastically gives a few thing to try.
I try them, nothing works. It corrects itself with more things to try, more modifications to my code. Each time giving a plausible explanation as to what went wrong. Each time extra confident that it got the issue pinned down this time. After maybe two very frustrating hours, I tell it to go fuck itself, close the tab and switch my brain on again.
10 minutes later, I notice my buffer's format doesn't match the one used in the render pass that draws to it. Correct that, compile, and it works.
I genuinely don't understand what those pro-LLM-coding guys are doing that they find AIs helpful. I can manage the easy parts of my job on my own, and it fails miserably on the hard parts. Are those people only writing boilerplate all day long?
_pdp_ 5 hours ago [-]
LLMs cannot build software on their own yet. They are can sure build software with some help.
madrasman 7 hours ago [-]
The 2 iOS apps that I published (mid level complexity and work well) say otherwise. I was blown away by what cursor + o3 could do.
tw1984 43 minutes ago [-]
the mentioned "two similar mental models" is an interesting of looking at the problem. if that is the actual case, seems to me that a much better model plus an smart enough agent should be able to largely solve the problem.
interesting time, interesting issue.
alliancedamages 20 hours ago [-]
> ...but the distinguishing factor of effective engineers is their ability to build and maintain clear mental models.
I wonder is this not just a proxy for intelligence?
20 hours ago [-]
guluarte 20 hours ago [-]
Turns out, English is pretty bad for creating deterministic software. If you are vibe coding, you either are happy with the randomness generated by the LLMs or you enter a loop to try to generate a deterministic output, in which case using a programming language could have been easier.
tempodox 18 hours ago [-]
That's what I don't understand about AI coding fans. Instead of using a language that was designed to produce executable code, they insert another translation stage with a much murkier and fuzzier language. So you have to learn a completely new interface that is less fit for the task for the benefit of uncertain outcomes. And woe betide you if you step outside the most mainstream of mainstreams, where there's not an overabundance of training data.
bagacrap 17 hours ago [-]
It's because the AI coding fans already don't know programming languages, so they're learning a new language/interface either way.
That, and their software doesn't actually have any users, I find.
nextworddev 20 hours ago [-]
60% of the complaints in this post can be solved by providing better requirements and context upfront
windward 3 hours ago [-]
You can't just pile on more context, it has to be a brief distillation of the right context. Otherwise as it gets longer and longer, each point becomes less likely to be considered. Or,r worse, it gets considered in the negative. It feels at times like training a dog. You can't say "don't sit"!
It's understandably frustrating that the promised future ended up being humans having to work how machines want.
bubblyworld 6 hours ago [-]
I think the argument here is nonsense. LLMs clearly work differently to human cognition, so pointing out a difference between how LLMs and humans approach a problem and calling that the reason that they can't build software makes no sense. Plausibly there are many ways to build software that don't make sense to a human.
That said, I agree with the conclusion. They do seem to be missing coherent models of what they work on - perhaps part of the reason they do so poorly on benchmarks like ARC, which are designed to elicit that kind of skill?
anotheryou 20 hours ago [-]
Maybe we should let it build a mental model in documentation markdown files?
Vibing I often let it explain the implemented business logic (instead of reading the code directly) and judge that.
Onewildgamer 21 hours ago [-]
I wonder if some of this can be solved by removing some wrongly setup context in LLM. Or get a short summary, restructure it and againt feed to a fresh LLM context.
layer8 20 hours ago [-]
I suspect that context can’t fully replace a mental model, because context is in-band, in the same band, as all input the LLM receives. It’s all just a linear token sequence that is taken in uniformly. There’s too little structure, and everything is equally subject to being discarded or distorted within the model. Even if parts of that token sequence remains unchanged (a “stable” context) when iterating over input, the input it is surrounded with can have arbitrary downstream effects within the model, making it more unreliable and unstable than mental models are.
Onewildgamer 5 hours ago [-]
Okay I see now. I'm just shooting in the dark here, if there's an ability to generate the next best token based on the trained set of words. Can it be taken a level up, in a meta level to generate a generation? like genetic programming does. Or is that what the chain of thought reasoning models do?
Maybe I need to do more homework on LLMs in general.
siva7 5 hours ago [-]
> But, we firmly believe that (at least for now) you are in the drivers seat, and the LLM is just another tool to reach for.
So does Microsoft and Github. At least that's what they were telling us the whole time. Oh wait.. they changed their mind i think a week ago.
kypro 4 hours ago [-]
> LLMs get endlessly confused: they assume the code they wrote actually works; when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over.
That's actually an interesting point, and something I've noticed a lot myself. I find LLMs are very good at hacking around test failures, but unless the test is failing for a trivial reason often it's pointing at some more fundamental issue with the underlying logic of the application which LLMs don't seem to be able to pick up on, likely because they don't have a comprehensive mental model of how the system should work.
I don't want to point fingers, but I've been seeing this quite a bit in the code of colleagues who heavily use LLMs. On the surface the code looks fine, and they've produced tests which pass, but when you think about it for more than a minute you realise it doesn't really capture nuance of the requirements, and in a way a human who had a mental model of the how the system probably wouldn't have done...
Sometimes humans miss things in the logic when they're writing code, but these look more like mistakes in a line rather than a fundamental failure to comprehend and model the problem. And I know this isn't the case, because when you talk to these developers they get the problem perfectly well.
To know when the code needs fixing or a test you need a very clear idea of what should be happening and LLMs just don't. I don't know why that is. Maybe it's just they're missing context from the hours of reading tickets and technical discussions, or maybe it's their failure to ask questions when they're unsure of what should be happening. I don't know if this a fundamental limitation of LLMs (I'd suspect not personally), but this is a problem when using LLMs to code today and one that more compute alone probably can't fix.
saghm 21 hours ago [-]
> Context omission: Models are bad at finding omitted context.
> Recency bias: They suffer a strong recency bias in the context window.
> Hallucination: They commonly hallucinate details that should not be there.
To be fair, those are all issues that most human engineers I've worked with (including myself!) have struggled with to various degrees, even if we don't refer to them the same way. I don't know about the rest of you, but I've certainly had times where I found out that an important nuance of a design was overlooked until well into the process of developing something, forgotten a crucial detail that I learned months ago that would have helped me debug something much faster than if I had remembered it from the start, or accidentally make an assumption about how something worked (or misremembered it) and ended up with buggy code as a result. I've mostly gotten pretty positive feedback about my work over the course of my career, so if I "can't build software", I have to worry about the companies that have been employing me and my coworkers who have praised my work output over the years. Then again, I think "humans can't build software reliably" is probably a mostly correct statement, so maybe the lesson here is that software is hard in general.
skydhash 21 hours ago [-]
That’s a communication issue. You should learn how to ask the right questions and document the answers given. What I’ve seen is developers assuming stuff when they should just reach out to team members. Or trying stuff instead of reading documentation. Or trying to remember info instead of noting it down somewhere.
saghm 20 hours ago [-]
Well, yeah, obviously if you're perfectly diligent and never screw up, it's possible to be correct 100% of the time. In my experience, even extremely smart diligent people who are good at asking the right questions and reading documentation still mess up sometimes, which is the point I'm trying to make. If you genuinely don't ever encounter this issue, I guess everyone I've ever worked with and I just aren't as perfect as you and the people you've worked with, but I'd argue that you're not having the average experience of working with regular people if that's the case. Most of us are mere mortals who are sometimes fallible, and while the exact underlying mechanism of how we make mistakes might not be literally identical to the issues described in the article, my point is that the difference might just be a matter of degree rather than something fundamentally different in what types of errors occur.
21 hours ago [-]
VladTepes2025 18 hours ago [-]
Have faith in AI, one day it will do what we hallucinate it can do!
empath75 21 hours ago [-]
It's good at micro, but not macro. I think that will eventually change with smarter engineering around it, larger context windows, etc. Never underestimate how much code that engineers will write to avoid writing code.
pmdr 21 hours ago [-]
> It's good at micro, but not macro.
That's what I've found as well. Start describing or writing a function, include the whole file for context and it'll do its job. Give it a whole codebase and it will just wander in the woods burning tokens for ten minutes trying to solve dependencies.
18 hours ago [-]
20 hours ago [-]
shad42 18 hours ago [-]
LLMs are powerful assistants—as long as the user keeps a firm mental model of the problem. That’s why, for now, they complement software engineers rather than replace them (at least today).
When you already know exactly what needs to be built and simply want to skip the drudgery of boilerplate or repetitive tasks, a coding CLI is great: it handles the grunt work so you can stay focused on the high-level design and decision-making that truly matter (and also more fun).
trod1234 21 hours ago [-]
I think most people trying to touch on this topic don't consider this byline with other similar bylines like, "Why LLMs can't recognize themselves looping", or "Why LLMs can't express intent", or "Why LLMs can't recognize truth/falsity, or confidence levels of what they know vs don't know", these other bylines basically with a little thought equate to Computer Science halting problems, or the undecidability nature of mathematics.
Taken to a next step, recognizing this makes the investment in such a moonshot pipedream (overcoming these inherent problems in a deterministic way), recklessly negligent.
revskill 21 hours ago [-]
They can read and mind the error then figure out the best way to resolve. It is the best part about llm. No human can do it better than an llm. But they are not your mind reader. It is where things fall apart.
Nickersf 21 hours ago [-]
I think they're another tool in the toolbox not a new workshop. You have to build a good strategy around LLM usage when developing software. I think people are naturally noticing that and adapting.
antihipocrat 21 hours ago [-]
..."(at least for now) you are in the drivers seat, and the LLM is just another tool to reach for."
Improvements in model performance seem to be approaching the peak rather than demonstrating exponential gains. Is the quote above where we land in the end?
sneak 21 hours ago [-]
Am I the only one continuously astounded at how well Opus 4 actually does build mental models when prompted correctly?
I find Sonnet frequently loses the plot, but Opus can usually handle it (with sufficient clarity in prompting).
codr7 21 hours ago [-]
Well, welcome to the club of awareness :)
layer8 20 hours ago [-]
Awareness is all we need. ;)
robomartin 19 hours ago [-]
I decided to jump into the deep end of the pool and complete two projects using Cursor with it's default AI setup.
The first project is a C++ embedded device. The second is a sophisticated Django-based UI front end for a hardware device (so, python interacting with hardware and various JS libraries handling most of the front end).
So far I am deeper into the Django project than the C++ embedded project.
It's interesting.
I had already hand-coded a conceptual version of the UI just to play with UI and interaction ideas. I handed this to Cursor as well as a very detailed specification for the entire project, including directory structure, libraries, where to use what and why, etc. In other words, exactly what I would provide a contractor or company if I were to outsource this project. I also told it to take a first stab at the front end based on the hand-coded version I plopped into a temporary project directory.
And then I channeled Jean-Luc Picard and said "Engage!".
The first iteration took a few minutes. It was surprisingly functional and complete. Yet, of course, it had problems. For example, it failed to separate various screens into separate independent Django apps. It failed to separate the one big beautiful CSS and JS files into independent app-specific CSS and JS files. In general, it ignored separation of concerns and just made it all work. This is the kind of thing you might expect from a junior programmer/fresh grad.
Achieving separation of concerns and other undesirable cross-pollination of code took some effort. LLM's don't really understand. They simulate understanding very well, but, at the end of the day, I don't think we are there. They tend to get stuck and make dumb mistakes.
The process to get to something that is now close to a release candidate entailed an interesting combination of manual editing and "molding" of the code base with short, precise and scope-limited instructions for Cursor. For my workflow I am finding that limiting what I ask AI to do delivers better results. Go too wide and it can be in a range between unpredictable and frustrating.
Speaking of frustrations, one of the most mind-numbing things it does every so often is also in a range, between completely destroying prior work or selectively eliminating or modifying functionality that used to work. This is why limiting the scope, for me, has been a much better path. If I tell it to do something in app A, there's a reasonable probability that it isn't going to mess with and damage the work done in app B.
This issue means that testing become far more important in this workflow, because, on every iteration, you have no idea what functionality may have been altered or damaged. It will also go nuts and do things you never asked it to do. For example, I was in the process of redoing the UI for one of the apps. For some reason it decided it was a good idea to change the UI for one of the other apps, remove all controls and replace them with controls it thought were appropriate or relevant (which wasn't even remotely the case). And, no, I did not ask it to touch anything other than the app we were working on.
Note: For those not familiar with Django, think of an app as a page with mostly self-contained functionality. Apps (pages) can share data with each other through various means, but, for the most part, the idea is that they are designed as independent units that can be plucked out of a project and plugged into another (in theory).
The other thing I've been doing is using ChatGPT and Cursor simultaneously. While Cursor is working I work with ChatGPT on the browser to plan the next steps, evaluate options (libraries, implementation, etc.) and even create quick stand-alone single file HTML tests I can run without having to plug into the Django project to test ideas. I like this very much. It works well for me. It allows me to explore ideas and options in the context of an OpenAI project and test things without the potential to confuse Cursor. I have been trying to limit Cursor to being a programmer, rather than having long exploratory conversations.
Based on this experience, one thing is very clear to me: If you don't know what you are doing, you are screwed. While the OpenAI demo where they have v5 develop a French language teaching app is cool and great, I cannot see people who don't know how to code producing anything that would be safe to bet the farm on. The code can be great and it can also be horrific. It can be well designed and it can be something that would cause you to fail your final exams in a software engineering course. There's great variability and you have to get your hands in there, understand and edit code by hand as part of the process.
Overall, I do like what I am seeing. Anyone who has done non-trivial projects in Django knows that there's a lot of busy boilerplate typing that is just a pain in the ass. With Cursor, that evaporates and you can focus on where the real value lies: The problem you are trying to solve.
I jump into the embedded C++ project next week. I've already done some of it, but I'm in that mental space 100% next week. Looking forward to new discoveries.
The other reality is simple: This is the worse this will ever be. And it is already pretty good.
morteify 4 hours ago [-]
[dead]
jeffWrld 21 hours ago [-]
[dead]
cindyllm 20 hours ago [-]
[dead]
Xplan 21 hours ago [-]
[dead]
reramuyc 18 hours ago [-]
[dead]
myfavoritedog 21 hours ago [-]
[dead]
random2019 14 hours ago [-]
[dead]
ontigola 21 hours ago [-]
Great, concise article. Nothing important to add, except that AI snake-oil salesmen will continue spreading their exaggerations far and wide, at least we who are truly in this business agree on the facts.
jmclnx 21 hours ago [-]
I am not a fan of today's concept of "AI", but to be fair, building today's software is not for the faint of heart, very few people gets it right on try 1.
Years ago I gave up compiling these large applications all together. I compiled Firefox via FreeBSD's (v8.x) ports system, that alone was a nightmare.
I cannot imagine what it would be like to compile GNOME3 or KDE or Libreoffice. Emacs is the largest thing I compile now.
anotherhue 21 hours ago [-]
I suggest trying Nix, by being reproducible those nasty compilation demons get solved once and for all. (And usually by someone else)
trod1234 21 hours ago [-]
The problem with Nix is that its often claimed to be reproducible, but the proof isn't really there because of the existence of collisions. The definition of reproducible is taken in such an isolated context as to be almost absurd.
While a collision hasn't yet been found for a SHA256 package on Nix, by the pigeonhole principle they exist, and the computer will not be able to decide between the two packages in such a collision leading to system level failure, with errors that have no link to cause (due to the properties involved, and longstanding CS problems in computation).
These things generally speaking contain properties of mathematical chaos which is a state that is inherently unknowable/unpredictable that no admin would ever approach or touch because its unmaintainable. The normally tightly coupled error handling code is no longer tightly coupled because it requires matching a determinable state (CS computation problems, halting/decidability).
Non-deterministic failure domains are the most costly problems to solve because troubleshooting which leverages properties of determinism, won't work.
This leaves you only a strategy of guess and check; which requires intimate knowledge of the entire system stack without abstractions present.
anotherhue 20 hours ago [-]
Respectfully, you sound like AI. I expect you don't trust git either, especially as its hash is weaker.
A cursory look at a nix system would also show you that the package name, version and derivation sha are all concatenated together.
trod1234 20 hours ago [-]
Respectfully, I sound like a Computer Engineer because I've worked alongside quite a number of them, and the ones I've worked with had this opinion too.
> A cursory look at a nix system would show ... <three things concattenated together>
This doesn't negate or refute the pigeonhole principle. In following pigeonhole there is some likelihood that a collision will exist, and that probability trends to 1 given sufficient iterations (time).
The only argument you have is a measure of likelihood and probability, which is a streetlight effect cognitive bias or intelligence trap. There's a video which discusses these type of traps on youtube, TED from an ex-CIA officer.
Likelihood and probability are heavily influenced by the priors they measure, and without perfect knowledge (which no one has today) those priors may deviate significantly, or be indeterminable.
Imagine for a second that a general method for rapidly predicting collisions, regardless of algorithm, is discovered and released; which may not be far off given current advances with quantum computing.
All the time and money cumulatively spent towards Nix, as cost becomes wasted, and you are left in a position of complete compromise suddenly and without a sound pivot for comparable cost (previously).
With respect, if you can't differentiate basic a priori reasoned logic from AI, I would question your perceptual skills and whether they are degrading. There is a growing body of evidence that exposure to AI may cause such degradation as seems to be starting to be seen with regards to doctors and their use and diagnostics after use in various studies (1).
On the contrary, Kiro (https://kiro.dev) is showing that it can be done by breaking down software engineering into multiple stages (requirements, design, and tasks) and then breaking the tasks down into discrete subtasks. Each of those can then be customized and refined as much as you like. It will even sketch out initial documents for all three.
It’s still early days, but we are learning that as with software written exclusively by humans, the more specific the specifications are, the more likely the result will be as you intended.
quantumHazer 20 hours ago [-]
a 1 minutes research on the internet led me to discover that you are MARKETING MANAGER at amazon. so your take is full of conflict of interest and this should be disclosed.
dmacfour 19 hours ago [-]
There's an absurd amount of astroturfing in discussions about AI. Especially on Reddit.
otterley 18 hours ago [-]
Fair enough and I apologize for not disclosing it. However, Kiro is not a service in scope for me, and this is my own opinion, not that of the company.
And it’s not a conflict of interest. I’m free to criticize my company if I like.
mccoyb 21 hours ago [-]
This is a low information density blog post. I’ve really liked Zed’s blog posts in the past (especially about the editor internals!) so I hope this doesn’t come the wrong way, but this seems to be a loose restatement of what many people are empirically finding out by using LLM agents.
Perhaps good for someone just getting their feet wet with these computational objects, but not resolving or explaining things in a clear way, or highlighting trends in research and engineering that might point towards ways forward.
You also have a technical writing no no where you cite a rather precise and specific study with a paraphrase to support your claims … analogous to saying “Godel’s incompleteness theorem means _something something_ about the nature of consciousness”.
A phrase like: “Unfortunately, for now, they cannot (beyond a certain complexity) actually understand what is going on” referencing a precise study … is ambiguous and shoddy technical writing — what exactly does the author mean here? It’s vague.
I think it is even worse here because _the original study_ provides task-specific notions of complexity (a critique of the original study! Won’t different representations lead to different complexity scaling behavior? Of course! That’s what software engineering is all about: I need to think at different levels to control my exposure to complexity)
That, and we also don't only focus on the textual description of a problem when we encounter a problem. We don't see the debugger output and go "how do I make this bad output go away?!?". Oh, I am getting an authentication error. Well, meaybe I should just delete the token check for that code path...problem solved?!
No. Problem very much not-solved. In fact, problem very much very bigger big problem now, and [Grug][1] find himself reaching for club again.
Software engineers are able to step back, think about the whole thing, and determine the root cause of a problem. I am getting an auth error...ok, what happens when the token is verified...oh, look, the problem is not the authentication at all...in fact there is no error! The test was simply bad and tried to call a higher privilege function as a lower privilege user. So, test needs to be fixed. And also, even though it isn't per-se an error, the response for that function should maybe differentiate between "401 because you didn't authenticate" and "401 because your privileges are too low".
[1]: https://grugbrain.dev
In the past, I've worked with developers that do. You ask them to investigate and deal with an error message, and all they do is whatever makes the error go away. Oh, a null pointer exception is thrown? Lets wrap it in a try/catch and move on.
Yes, such workflows (jobs or) may become obsolete with some of the modern AI tools. Is that a bad thing? Not sure...
Even translations between human languages (which allows for ambiguity) can be messy. Imagine if the target language is for a system that will exactly do as told unless someone has qualified those actions as bad.
Also, as someone else said, consider the root causes of an issue, whether those are in code logic or business ops or some intersection between the two.
When I save twenty hours of a client's money and my own time, by telling them that a new software feature they want would be unnecessary if they changed the order of questions their employees ask on the phone, I've done my job well.
By the same token, if I'm bored and find weird stuff in the database indicating employees tried to perform the same action twice or something, that is something that can be solved with more backstops and/or a better UI.
Coding business logic is not a one-way street. Understanding the root causes and context of issues in the code itself is very hard and requires you to have a mental model of both domains. Going further and actually requesting changes to the business logic which would help clean up the code requires a flexible employer, but also an ability to think on a higher order than simply doing some CRUD tasks.
The fact that I wouldn't trust any LLM to touch any of my code in those real world cases makes me think that most people who are touting them are not, in fact, writing code at the same level or doing the same job I do. Or understand it very well.
I like to explain my work as "do whatever is needed to do as little work as possible".
Being by improving logs, improving architecture, updating logs, pushing responsibilities around or rejecting some features.
More significantly though, OP seems right on to me. The basic functionality of LLMs is handy for a code writing assistant, but does not replace a software engineer, and is not ever likely too no matter how many janky accessories we bolt on. LLMs are fundamentally semantic pattern matching engines, and are only problem solvers in the context of problems that are either explicitly or implicitly defined and solved in their training data. They will always require supervision because there is fundamentally no difference between a useful LLM output and a “hallucination” except the utility rating that a human judge applies to the output.
LLMs are good at solving fully defined, fully solved problems. A lot of work falls into that category, but some does not.
So right now an LLM and the developer you describe here are two very different thing and an LLM will, by design, never replace you
And to have boundless contextual awareness… dig a rabbit hole, but beware that you are in your own hole. At this point you can escape the hole but you have to be purposefully aware of what guardrails and ladders you give the agent to evoke action.
The better, more explicit guardrails you provide the more likely the agent is able to do what is expected and honor the scope and context you establish. If you tell it to use silverware to eat, be assured it doesn’t mean to use it appropriately or idiomatically and it will try eating soup with a fork.
Lastly don’t be afraid of commits and checkpoints, or to reject/rollback proposed changes and restate or reset the context. The agent might be the leading actor, but you are the director. When a scene doesn’t play out, try it again after clarification or changing camera perspective or lighting or lines, or cut/replace the scene entirely.
> The fact that I wouldn't trust any LLM to touch any of my code in those real world cases makes me think that most people who are touting them are not, in fact, writing code at the same level or doing the same job I do. Or understand it very well.
I agree with this specifically for agentic LLM use. However, I've personally increased my code speed and quality with LLMs for sure using purely local models as a really fancy auto complete for 1 or 2 lines at a time.
The rest of your comment is good, bit the last paragraph to me reads like someone inexperienced with LLMs looking to find excuses to justify not being productive with them, when others clearly are. Sorry.
This sentiment of, a human will always be needed, there’s no replacement for human touch, the stakes are too high, is as old as time
You just said, quite literally, that people leveraging LLMs to code are not doing it at your level - that’s borders on hubris
The fact of the matter is that like most tools, you get out of AI what you put into it
I know a lot of engineers and this pride, this reluctance to accept the help is super common
The best engineers on the other hand are leveraging this just fine, just another tool for them that speeds things up
We’re living it. We see it every day. The business leaders cannot be convinced that this isn’t making less skilled developers more productive.
Those rules are also very fuzzy and only get defined more formally by the coding process.
But that's just plain wrong and a proper developer would be allowed to change that. If you're not authenticating properly, you get a 401. That means you can't prove you're who you say you are.
If you are past that, i.e. we know that you are who you say you are, then the proper return code is 403 for saying "You are not allowed to access what you're trying to access, given who you are".
Which funnily enough seems to be a very elusive concept to many humans as well, never mind an LLM.
It really boils down to what scenario you have in mind. Developers do interact with product managers and discussions do involve information flowing both ways. Even if a PM ultimately decides what the product should do, you as a developer have a say in the process and outcome.
Also, there are always technological constraints, and some times even practical constraints are critical. A PM might want to push this or that feature but if it's impossible to deliver on a specific deadline they have no alternative to compromise, and the compromise is determined by what developers call out.
There's plenty of that work, and it goes by many names ("enterprise", others).
But lots and lots and lots of programmers are concerned with using computers for computations: making things with the new hardware that you couldnt with the old hardware being an example. Embedded, cryptography, graphics, simulation, ML, drones and compilers and all kinds of stuff are much more about resources than business logic.
You can define up business logic to cover anything I guess, but at some point its no longer what you meant by that.
I went and got an MBA to try and get around this. It didn't work.
When a non-developer writes code with an LLM, their ability to write good code decreases. But at the same time, it goes up thanks to more "business context."
In a year or two, I imagine that a non-developer with a proper LLM may surpass a vanilla developer.
They usually code for the happy path, and add edge cases as bugs are discovered in production. But after a while both happy path and edge cases blend into a ball of mud that you need the correct incantation to get running. And it's a logic maze that contradict every piece of documentation you can find (ticket, emails). Then it quickly become something that people don't dare to touch.
When the employer business isn't shipping software, engineers have no other option than actually learn the business as well.
Agree strongly, and I think this is basically what the article is saying as well about keeping a mental model of requirements/code behavior. We kind of already knew this was the hard part. How many times have you heard that once you get past junior level, the hard part is not writing the code? And that It's knowing what code to write? This realization is practically a right of passage.
Which kind of begs the question for what the software engineering job looks like in the future. It definitely depends on how good the AI is. In the most simplistic case, AI can do all the coding right now and all you need is a task issue. And frankly probably a user written (or at least reviewed, but probably written) test. You could make the issue and test upfront and farm out the PR to an agent and manually approve when you see it passed the test case you wrote.
In that case you are basically PM and QA. You are not even forming the prompt, just detailing the requirements.
But as the tech improves can all tasks fit into that model? Not design/architecture tasks - or at least without a new task completion model than described above. The window will probably grow, but its hard to imagine that it will handle all pure coding tasks. Even for large tasks that theorhetically can fit into that model, you are going to have to do a lot of thinking and testing and prototyping to figure out the requirements and test cases. In theory you could apply the same task/test process but that seems like it would be too much structure and indirection to actually be helpful compared to knowing how to code.
I agree with the PM role, but with such low requirements that anyone can do it.
The good ones wear multiple hats and actually define the problem, learns sufficiently about a domain to interact with it or the experts on said domain and figures out what are the short Vs long term tradeoffs to focus on the value and not just the technical aspect.
An earlier effort at AI was based on rules and the C. Forgy RETE algorithm. Soooo, rules have been tried??
Rules engines were traditionally written in Prolog or Lisp during the AI wave they were cool.
Forgy was Charles Forgy.
For a "rules engine", there was also IBM's YES/L1.
But software architects (especially of various reusable frameworks) have to maintain the right set of abstractions and make sure the system is correct and fast, easy to debug, that developers fall into the pit of success etc.
Here are just a few major ones, each of which would be a chapter in a book I would write about software engineering:
ENVIRONMENTS & WORKFLOWS Environment Setup Set up a local IDE with a full clone of the app (frontend, backend, DB). Use .env or similar to manage config/secrets; never commit them. Debuggers and breakpoints are more scalable than console.log. Prefer conditional or version-controlled breakpoints in feature branches. Test & Deployment Environments Maintain at least 3 environments: Local (dev), Staging (integration test), Live (production). Make state cloning easy (e.g., DB snapshots or test fixtures). Use feature flags to isolate experimental code from production.
BUGS & REGRESSIONS Bug Hygiene Version control everything except secrets. Use linting and commit hooks to enforce code quality. A bug isn’t fixed unless it’s reliably reproducible. Encourage bug reporters to reset to clean state and provide clear steps. Fix in Context Keep branches showing the bug, even if it vanishes upstream. Always fix bugs in the original context to avoid masking root causes.
EFFICIENCY & SCALE Lazy & On-Demand Lazy-load data/assets unless profiling suggests otherwise. Use layered caching: session, view, DB level. Always bound cache size to avoid memory leaks. Pre-generate static pages where possible—static sites are high-efficiency caches. Avoid I/O Use local computation (e.g., HMAC-signed tokens) over DB hits. Encode routing/logic decisions into sessionId/clientId when feasible. Partitioning & Scaling Shard your data; that’s often the bottleneck. Centralize the source of truth; replicate locally. Use multimaster sync (vector clocks, CRDTs) only when essential. Aim for O(log N) operations; allow O(N) preprocessing if needed.
CODEBASE DESIGN Pragmatic Abstraction Use simple, obvious algorithms first—optimize when proven necessary. Producer-side optimization compounds through reuse. Apply the 80/20 rule: optimize for the common case, not the edge. Async & Modular Default to async for side-effectful functions, even if not awaited (in JS). Namespace modules to avoid globals. Autoload code paths on demand to reduce initial complexity. Hooks & Extensibility Use layered architecture: Transport → Controller → Model → Adapter. Add hookable events for observability and customization. Wrap external I/O with middleware/adapters to isolate failures.
SECURITY & INTEGRITY Input Validation & Escaping Validate all untrusted input at the boundary. Sanitize input and escape output to prevent XSS, SQLi, etc. Apply defense-in-depth: validate client-side, then re-validate server-side. Session & Token Security Use HMACs or signatures to validate tokens without needing DB access. Enable secure edge-based filtering (e.g., CDN rules based on token claims). Tamper Resistance Use content-addressable storage to detect object integrity. Append-only logs support auditability and sync.
INTERNATIONALIZATION & ACCESSIBILITY I18n & L10n Externalize all user-visible strings. Use structured translation systems with context-aware keys. Design for RTL (right-to-left) languages and varying plural forms. Accessibility (A11y) Use semantic HTML and ARIA roles where needed. Support keyboard navigation and screen readers. Ensure color contrast and readable fonts in UI design.
GENERAL ENGINEERING PRINCIPLES Idempotency & Replay Handlers should be idempotent where possible. Design for repeatable operations and safe retries. Append-only logs and hashes help with replay and audit. Developer Experience (DX) Provide trace logs, debug UIs, and metrics. Make it easy to fork, override, and simulate environments. Build composable, testable components.
ADDITIONAL TOPICS WORTH COVERING Logging & Observability Use structured logging (JSON, key-value) for easy analysis. Tag logs with request/session IDs. Separate logs by severity (debug/info/warn/error/fatal). Configuration Management Use environment variables for config, not hardcoded values. Support override layers (defaults → env vars → CLI → runtime). Ensure configuration is reloadable without restarting services if possible. Continuous Integration / Delivery Automate tests and checks before merging. Use canary releases and feature flags for safe rollouts. Keep pipelines fast to reduce friction.
You should probably go do that, rather than using the comment section of HN as a scratch pad of your stream of consciousness. That's not useful to anyone other than yourself.
Is this a copypasta you just have laying around?
If irony was a ton of bricks, you'd be dead
Not really. It goes off on a tangent, and frankly I stopped reading the wall of text because it adds nothing of value.
If you write a wall of text where the first pages are inane drivel, what do you think are the odds that the rest of that wall of text suddenly adds readable gems?
Sometimes a turd is just a turd, and you don't need to analyze all of it to know the best thing to do is to flush it.
It really isn't. There is no point to pretend it is, and even less of a point to expect anyone should waste their time with an unreadable and incoherent wall of text.
You decide how you waste your time, and so does everyone else.
1. Set up a local IDE with a full clone of the app (frontend, backend, DB).
Thus the app must be fully able to run on a small, local environment, which is true of open source apps but not always for for-profit companies
2. Use .env or similar to manage config/secrets; never commit them.
A lot of people don’t properly exclude secrets from version control, leading to catastrophic secret leaks. Also when everyone has their own copy, the developer secrets and credentials aren’t that important.
3. Debuggers and breakpoints are more scalable than console.log. Prefer conditional or version-controlled breakpoints in feature branches.
A lot of people don’t use debuggers and breakpoints, instead doing logging. Also they have no idea how to maintain DIFFERENT sets of breakpoints, which you can do by checking the project files into version control, and varying them by branches.
4. Test & Deployment Environments Maintain at least 3 environments: Local (dev), Staging (integration test), Live (production).
This is fairly standard advice, but it is best practice, so people can test in local and staging.
5. Make state cloning easy (e.g., DB snapshots or test fixtures).
This is not trivial. For example downloading a local copy of a test database, to test your local copy of Facebook with a production-style database. Make it fast, eg by rsync mysql innodb files.
If this is how you think LLMs and Coding Agents are going about writing code, you haven't been using the right tools. Things happen, sure, but also mostly don't. Nobody is arguing that LLM-written code should be pushed directly into production, or that they'll solve every task.
LLMs are tools, and everyone eventually figures out a process that works best for them. For me, it was strongs specs/docs, strict types, and lots of tests. And then of course the reviews if it's serious work.
Llms are really good at template tasks, writing tests, boilerplate etc. But, Most times I'm not doing implement this button. I'm doing there's a logic mismatch in my expectation
And the moment the context is compacted, it forgets this instruction “fix the problems, don’t delete the file,” and tries to delete it again. I need to watch it like a hawk.
What an LLM cannot do today is almost irrelevant in the tide of change upon the industry. The fact is, with improvements, it doesn't mean an LLM cannot do it tomorrow.
LLMs are not like this. The fundamental way they operate, the core of their design is faulty. They don't understand rules or knowledge. They can't, despite marketing, really reason. They can't learn with each interaction. They don't understand what they write.
All they do is spit out the most likely text to follow some other text based on probability. For casual discussion about well-written topics, that's more than good enough. But for unique problems in a non-English language, it struggles. It always will. It doesn't matter how big you make the model.
They're great for writing boilerplate that has been written a million times with different variations - which can save programmers a LOT of time. The moment you hand them anything more complex it's asking for disaster.
> Prove to me that human thought is not predicting the most probable next token.
Explain the concept of color to a completely blind person. If their brain does nothing but process tokens this should be easy.
> How can you tell a human actually understands?
What a strange question coming from a human. I would say if you are a human with a consciousness you are able to answer this for yourself, and if you aren't no answer will help.
Modern coding AI models are not just probability crunching transformers. They haven't been just that for some time. In current coding models the transformer bit is just one part of what is really an expert system. The complete package includes things like highly curated training data, specialized tokenizers, pre and post training regimens, guardrails, optimized system prompts etc, all tuned to coding. Put it all together and you get one shot performance on generating the type of code that was unthinkable even a year ago.
The point is that the entire expert system is getting better at a rapid pace and the probability bit is just one part of it. The complexity frontier for code generation keeps moving and there's still a lot of low hanging fruit to be had in pushing it forward.
> They're great for writing boilerplate that has been written a million times with different variations
That's >90% of all code in the wild. Probably more. We have three quarters of a century of code in our history so there is very little that's original anymore. Maybe original to the human coder fresh out of school, but the models have all this history to draw upon. So if the models produce the boilerplate reliably then human toil in writing if/then statements is at an end. Kind of like - barring the occasional mad genious [0] - the vast majority of coders don't write assembly to create a website anymore.
[0] https://asm32.info/index.cgi?page=content/0_MiniMagAsm/index...
It seems you were not aware you ended up describing probabilistic coding transformers. Each and every single one of those details are nothing more than strategies to apply constraints to the probability distributions used by the probability crunching transformers. I mean, read what you wrote: what do you think that "curated training data" means?
> Put it all together and you get one shot performance on generating the type of code that was unthinkable even a year ago.
This bit here says absolutely nothing.
This is lipstick on a pig. All those methods are impressive, but ultimately workarounds for an idea that is fundamentally unsuitable for programming.
>That's >90% of all code in the wild. Probably more.
Maybe, but not 90% of time spent on programming. Boilerplate is easy. It's the 20%/80% rule in action.
I don't deny these tools can be useful and save time - but they can't be left to their own devices. They need to be tightly controlled and given narrow scopes, with heavy oversight by an SME who knows what the code is supposed to be doing. "Design W module with X interface designed to do Y in Z way", keeping it as small as possible and reviewing it to hell and back. And keeping it accountable by making tests yourself. Never let it test itself, it simply cannot be trusted to do so.
LLMs are incredibly good at writing something that looks reasonable, but is complete nonsense. That's horrible from a code maintenance perspective.
https://xkcd.com/1205/
After a while, it just make sense to redesign the boilerplate and build some abstraction instead. Duplicated logic and data is hard to change and fix. The frustration is a clear signal to take a step back and take an holistic view of the system.
And even with all that, they still produce garbage way too often. If we continue the "car" analogy, the car would crash randomly sometimes when you leave the driveway, and sometimes it would just drive into the house. So you add all kinds of fancy bumpers to the car and guard rails to the roads, and the car still runs off the road way too often.
Not to disagree, but "non-english" isn't exactly relevant. For unique problems, LLMs can still manage to output hallucinations that end up being right or useful. For example, LLMs can predict what an API looks like and how it works even if they do not have the API in context if the API was designed following standard design principles and best practices. LLMs can also build up context while you interact with them, which means that iteratively prompting them that X works while Y doesn't will help them build the necessary and sufficient context to output accurate responses.
This is the first word that came to mind when reading the comment above yours. Like:
>They can't, despite marketing, really reason
They aren't, despite marketing, really hallucinations.
Now I understand why these companies don't want to market using terms like "extrapolated bullshit", but I don't understand how there is any technological solution to it without starting from a fresh base.
They are hallucinations. You might not be aware of what that concept means in terms of LLMs but just because you are oblivious to the definition of a concept that does not mean it doesn't exist.
You can learn about the concept by spending a couple of minutes reading this article on Wikipedia.
https://en.wikipedia.org/wiki/Hallucination_(artificial_inte...
> Now I understand why these companies don't want to market using terms like "extrapolated bullshit", (...)
That's literally in the definition. Please do yourself a favour and get acquainted with the topic before posting comments.
>(also called bullshitting,[1][2] confabulation,[3] or delusion)[4]
Here's the first linked source:
https://www.psypost.org/scholars-ai-isnt-hallucinating-its-b...
Irrelevant. Wikipedia does not create concepts. Again, if you take a few minutes to learn about the topic you will eventually understand the concept was coined a couple of decades ago, and has a specific meaning.
Either you opt to learn, or you don't. Your choice.
> Here's the first linked source:
Irrelevant. Your argument is as pointless and silly as claiming rubber duck debugging doesn't exist because no rubber duck is involved.
how so? programs might use english words but are decidedly not english.
I pointed out the fact that the concept of a language doesn't exist in token predictors. They are trained with a corpus, and LLMs generate outputs that reflect how the input is mapped in accordance to how the were trains with said corpus. Natural language makes the problem harder, but not being English is only relevant in terms of what corpus was used to train them.
Religious fervor in one's own opinion on the state of the world seems to be the zeitgeist.
Said like a true software person. I'm to understand that computer people are looking at LLMs from the wrong end of the telescope; and that from a neuroscience perspective, there's a growing consensus among neuroscientists that the brain is fundamentally a token predictor, and that it works on exactly the same principles as LLMs. The only difference between a brain and an LLM maybe the size of its memory, and what kind and quality of data it's trained on.
Hahahahahaha.
Oh god, you're serious.
Sure, let's just completely ignore all the other types of processing that the brain does. Sensory input processing, emotional regulation, social behavior, spatial reasoning, long and short term planning, the complex communication and feedback between every part of the body - even down to the gut microbiome.
The brain (human or otherwise) is incredibly complex and we've barely scraped the surface of how it works. It's not just nuerons (which are themselves complex), it's interactions between thousands of types of cells performing multiple functions each. It will likely be hundreds of years before we get a full grasp on how it truly works - if we ever do at all.
This is trivially proven false, because LLMs have far larger memory than your average human brain and are trained on far more data. Yet they do not come even close to approximating human cognition.
I feel like we're underestimating how much data we as humans are exposed to. There's a reason AI struggles to generate an image of a full glass of wine. It has no concept of what wine is. It probably knows way more theory about it than any human, but it's missing the physical.
In order to train AIs the way we train ourselves, we'll need to give it more senses, and I'm no data scientist but that's presumably an inordinate amount of data. Training AI to feel, smell, see in 3D, etc is probably going to cost exponentially more than what the AI companies make now or ever will. But that is the only way to make AI understand rather than know.
We often like to state how much more capacity for knowledge AI has than the average human, but in reality we are just underestimating ourselves as humans.
Tokens are a highly specific transformer exclusive concept. The human brain doesn't run a byte pair encoding (BPE) tokenizer [0] in their head. anything as tokens. It uses asynchronous time varying spiking analog signals. Humans are the inventors of human languages and are not bound to any static token encoding scheme, so this view of what humans do as "token prediction" requires either a gross misrepresentation of what a token is or what humans do.
If I had to argue that humans are similar to anything in machine learning research specifically, I would have to argue that they extremely loosely follow the following principles:
* reinforcement learning with the non-brain parts defining the reward function (primarily hormones and pain receptors)
* an extremely complicated non-linear kalman filter that not only estimates the current state of the human body, but also "estimates" the parameters of a sensor fusing model
* there is a necessary projection of the sensor fused result that then serves as available data/input to the reinforcement learning part of the brain
Now here are two big reasons why the model I describe is a better fit:
The first reason is that I am extremely loose and vague. By playing word games I have weaseled myself out of any specific technology and am on the level of concepts.
The second reason is that the kalman filter concept here is general enough that it also includes predictor models, but the predictor model here is not the output that drives human action, because that would logically require the dataset to already contain human actions, which is what you did, you assume that all learning is imitation learning.
In my model, any internal predictor model that is part of the kalman filter is used to collect data, not drive human action. Actions like eating or drinking are instead driven by the state of the human body, e.g. hunger is controlled through leptin and insulin and others. All forms of work, no matter how much of a detour it represents, ultimately has the goal of feeding yourself or your family (=reproduction).
[0] A BPE tokenizer is a piece of human written software that was given a dataset to generate an efficient encoding scheme and the idea itself is completely independent of machine learning and neural networks. The fundamental idea behind BPE is that you generate a static compression dictionary and never change it.
We can reasonably speak about certain fundamental limitations of LLMs without those being claims about what AI may ever do.
I would agree they fundamentally lack models of the current task and that it is not very likely that continually growing the context will solve that problem, since it hasn't already. That doesn't mean there won't someday be an AI that has a model much as we humans do. But I'm fairly confident it won't be an LLM. It may have an LLM as a component but the AI component won't be primarily an LLM. It'll be something else.
Neural networks are necessary but not sufficient. LLMs are necessary but not sufficient.
I have no doubt that there are multiple (perhaps thousands? more?) of LLM-like subsystems in our brains. They appear to be a necessary part of creating useful intelligence. My pet theory is that LLMs are used for associative memory purposes. They help generate new ideas and make predictions. They extract information buried in other memory. Clearly there is another system on top that tests, refines, and organizes the output. And probably does many more things we haven't even thought to name yet.
Alternatively, the goalposts keep being moved.
1. People are trying to sell a product that is not ready and thus are overhyping it
2. The tech is in its early days and may evolve into something useful via refinement and not necessarily by some radical paradigm shift
In order for (2) to happen it helps if the field is well motivated and funded (1)
The premise that an AI needs to do Y "as we do" to be good at X because humans use Y to be good at X needs closer examination. This presumption seems to be omnipresent in these conversations and I find it so strange. Alpha Zero doesn't model chess "the way we do".
> The premise that an AI needs to do Y "as we do" to be good at X because humans use Y to be good at X needs closer examination.
I don't see it being used as a premise. It see it as speculation that is trying to understand why this type of AI underperforms at certain types of tasks. Y may not be necessary to do X well, but if a system is doing X poorly and the difference between that system and another system seems to be Y, it's worth exploring if adding Y would improve the performance.
The sooner people stop worrying about a label for what you feel fits LLMs best, the sooner they can find the things they (LLMs) absolutely excel at and improve their (the user's) workflows.
Stop fighting the future. Its not replacing right now. Later? Maybe. But right now the developers and users fully embracing it are experiencing productivity boosts unseen previously.
Language is what people use it as.
Unfortunately, discourse has followed an epistemic trajectory influenced by Hollywood and science fiction, making clear communication on the subject nearly impossible without substantial misunderstanding.
This is the kind of thing that I disagree with. Over the last 75 years we’ve seen enormous productivity gains.
You think that LLMs are a bigger productivity boost than moving from physically rewiring computers to using punch cards, from running programs as batch processes with printed output to getting immediate output, from programming in assembly to higher level languages, or even just moving from enterprise Java to Rails?
Skepticism isn't the same thing as fighting the future.
I will call something AGI when it can reliably solve novel problems it hasn't been pre-trained on. That's my goal post and I haven't moved it.
EDIT - I see now. sorry.
For all intents and purposes of the public. AI == LLM. End of story. Doesn't matter what developers say.
This is interesting, because it's so clearly wrong. The developers are also the people who develop the LLMs, so obviously what they say is actually the factual matter of the situation. It absolutely does matter what they say.
But the public perception is that AI == LLM, agreed. Until it changes and the next development comes along, when suddenly public perception will change and LLMs will be old news, obviously not AI, and the new shiny will be AI. So not End of Story.
People are morons. Individuals are smart, intelligent, funny, interesting, etc. But in groups we're moronic.
Almost always, yes, because I know what I'm doing and I have a brain that can think. I actually think before I do anything, which leads to good results. Don't assume everyone is a junior.
>Didn't think so.
You don't know me at all.
Sure sometimes I do stuff I am not confident about to learn but then I don't say "here I solved the problem for you" without building confidence around the solution first.
Every competent senior engineer should be like this, if you aren't then you aren't competent. If you are confident in a solution then it should almost always work, else you are over confident and thus not competent. LLM are confident in solutions that are shit.
If you always use your first output then you are not a senior engineer, either your problem space is THAT simple that you can fit all your context in your head at the same time first try, or quite frankly you just bodge things together in non-optimal way.
It always takes some tries at a problem to grasp edge cases and to easier visualize the problem space.
This is not a fault of the users. These labels are pushed primarily by "AI" companies in order to hype their products to be far more capable than they are, which in turn increases their financial valuation. Starting with "AI" itself, "superintelligence", "reasoning", "chain of thought", "mixture of experts", and a bunch of other labels that anthropomorphize and aggrandize their products. This is a grifting tactic old as time itself.
From Sam Altman[1]:
> We are past the event horizon; the takeoff has started. Humanity is close to building digital superintelligence
Apologists will say "they're just words that best describe these products", repeat Dijkstra's "submarines don't swim" quote, but all of this is missing the point. These words are used deliberately because of their association to human concepts, when in reality the way the products work is not even close to what those words mean. In fact, the fuzzier the word's definition ("intelligence", "reasoning", "thought"), the more valuable it is, since it makes the product sound mysterious and magical, and makes it easier to shake off critics. This is an absolutely insidious marketing tactic.
The sooner companies start promoting their products honestly, the sooner their products will actually benefit humanity. Until then, we'll keep drowning in disinformation, and reaping the consequences of an unregulated marketplace of grifters.
[1]: https://blog.samaltman.com/the-gentle-singularity
I have the complete opposite feeling. The layman understanding of the term "AI" is AGI, a term that only needs to exist because researchers and businessmen hype their latest creations as AI.
The goalposts for AI don't move but the definition isn't precise but we know it when we see it.
AI, to the layman, is Skynet/Terminator, Asimov's robots, Data, etc.
The goalposts moving that you're seeing is when something the tech bubble calls AI escapes the tech bubble and everyone else looks at it and says, no, that's not AI.
The problem is that everything that comes out of the research efforts toward AI, the tech industry calls AI despite it not achieving that goal by the common understanding of the term. LLMs were/are a hopeful AI candidate but, as of today, they aren't but that doesn't stop OpenAI from trying to raise money using the term.
If you want some semantic rigour use more specific terms like AGI, human equivalent AGI, super human AGI, exponentially self improving AGI, etc. Even those labels lack rigour, but at least they are less ambiguous.
LLMs are pretty clearly AI and AGI under commonly understood, lay definitions. LLMs are not human level AGI and perhaps will never be by themselves.
LLMs may get better, but it will not be what people are clamoring them to be.
maybe they should have; a lot of the engineering techniques and methodologies that produced the assembly line and the mass produced vehicle also lead the way into space exploration.
* are many times the size of the occupants, greatly constricting throughput.
* are many times heavier than humans, requiring vastly more energy to move.
* travel at speeds and weights that are danger to humans, thus requiring strictly segregated spaces.
* are only used less than 5% of the day, requiring places to store them when unused.
* require extremely wide turning radiuses when traveling at speed (there’s a viral photo showing the entire historical city of Florence fit inside a single US cloverleaf interchange)
Not only have none of these flaws been fixed, many of them have gotten worse with advancing technology because they’re baked into the nature of cars.
Anyone at the invention of automobiles with sufficient foresight could have seen the intersecting incentives that cars would wreak, same as how many of the future impacts of LLMs are foreseeable today, independent of technical progress.
It can also learn new things using trial and error with mcp tools. Once it has figured out some problem, you can ask it to summarize the insights for later use.
What would define as an AI mental model?
To me as a layman, this feels like a clear explanation of how these tools break down, why they start going in circles when you reach a certain complexity, why they make a mess of unusual requirements, and why they have such an incredible nuanced grasp of complex ideas that are widely publicized, while being unable to draw basic conclusions about specific constraints in your project.
Dismissing a concern with “LLMs/AI can’t do it today but they will probably be able to do it tomorrow” isn’t all that useful or helpful when “tomorrow” in this context could just as easily be “two months from now” or “50 years from now”.
I mean, there was and then there wasn't. All of those things are shrinking fast because we handed over control to people who care more about profits than customers because we got too comfy and too cheap, and now right to repair is screwed.
Honestly, I see llm-driven development as a threat to open source and right to repair, among the litany of other things
A crucial ingredient might be missing.
"Every critique of AI assumes to some degree that contemporary implementations will not, or cannot, be improved upon.
Lemma: any statement about AI which uses the word "never" to preclude some feature from future realization is false.
Lemma: contemporary implementations have almost always already been improved upon, but are unevenly distributed."
And with fusion, we already have a working prototype (the Sun). And if we could just scale our tech up enough, maybe we’d have usable fusion.
(Sometimes that sort of criticism is spot on. If someone says they've got a brilliant new design for a perpetual motion machine, go ahead and tell them it'll never work. But in the general case it's overconfident.)
That is too reductive and simply not true. Contemporary critiques of AI include that they waste precious resources (such as water and energy) and accelerate bad environmental and societal outcomes (such as climate change, the spread of misinformation, loss of expertise), among others. Critiques go far beyond “hur dur, LLM can’t code good”, and those problems are both serious and urgent. Keep sweeping critiques under the rug because “they’ll be solved in the next five years” (eternally away) and it may be too late. Critiques have to take into account the now and the very real repercussions already happening.
But I'm really worried that the benefits are very localized, and that the externalized costs are vast, and the damage and potential damage isn't being addressed. I think that they could be one of the greatest ever drivers of inequality as a privileged few profit at the expense of the many.
Any debates seem neglect this as they veer off into AGI Skynet fantasy land damage rather than grounded real world damage. This seems to be deliberate distraction.
Instead, my brain parses code into something like an AST which then is represented as a spatial graph. I model the program as a logical structure instead of a textual one. When you look past the language, you can work on the program. The two are utterly disjoint.
I think LLMs fail at software because they're focused on text and can't build a mental model of the program logic. It take a huge amount of effort and brainpower to truly architect something and understand large swathes of the system. LLMs just don't have that type of abstract reasoning.
It's funny that everyone says that "LLMs" have plateaued, yet the base models have caught up with early attempts to build harnesses with the things I've mentioned above. They now match or exceed the previous generation software glue, with just "tools", even with limited ones like just "terminal".
Now go, researchers!
- When ever you address a failing test, always bring your component mental model into the context.
Paste that into your Claude prompt and see if you get better results. You'll even be able to read and correct the LLM's mental model.
Junior developers not even out of school don’t need to be instructed to think.
I'd tend to think it more proper if it were 401 you didn't authenticate and 403 you're forbidden from doing that with those user rights, but you have to be careful about exactly how detailed your messages are, lest they get tagged as a CWE-209 in your next security audit.
so current LLMs might not quite be human level, but I'd have to see a bigger model fail before I'd conclude that it can't do $X.
Put another way, you have an excel roster corresponding to people with accounts where some need to have their account shutdown but you only have their first and last names as identifiers, and the pool is sufficiently large that there are more than one person per a given set of names.
You can't shut down all accounts with a given name, and there is no unique identifier. How do you solve this?
You have to ask and be given that unique identifier that differentiates between the undecidable. Without that, even the person can't do the task.
The person can make guesses, but those guesses are just hallucinations with a significant n probability towards a bad repeat outcome.
At a core level I don't think these type of issues are going to be solved. Quite a lot of people would be unable to solve this and struggle with this example (when not given the answer, or hinted at the solution in the framing of the task; ie when they just have a list of names and are told to do an impossible task).
Kind of hyperbolic. If you prompt well, generally, it won't do stupid to that extreme.
”Wise men speak because they have something to say; Fools speak because they have to say something” -Plato
Grug is both the high and low end of the Bell curve.
> THINK they are big brained developers many, many more, and more even definitely probably maybe not like this, many sour face (such is internet)
> (note: grug once think big brained but learn hard way)
Concidentally, I encountered the author's work for the first time only a couple of days ago as a podcast guest, he vouches for the "Dirty Code" approach while straw-manning Uncle Bob's general principles of balancing terseness/efficiency with ergonomics and readability (in most, but not all, cases).
I guess this stuff sells t-shirts and mugs /rant
Have you read Uncle Bob? There's no need to strawman: Bob's examples in Clean Code are absolutely nuts.
Here's a nice writeup that includes one of Bob's examples verbatim in case you've forgotten: https://qntm.org/clean
Here's another: https://gerlacdt.github.io/blog/posts/clean_code/
Yes, I have read Uncle Bob. I could agree that the examples in the book leave room for improvement.
Meanwhile, the real-world application of these principles and trial-and-error, collectively within my industry, yields a more accurate picture of it's usefulness.
Even the most click-bait'y criticisms (such as the author I referenced above) involve zooming in on it's most-controversial aspects, in a vacuum, without addressing the core principles and how they're completely necessary for delivering software at scale, warranting it's status as a seminal work.
"...for the obedience of fools, and the guidance of wise men", indeed!
edit - it's the same arc as Agile has endured:
1. a good-faith argument for a better way of doing things is recognised and popularised.
2. It's abused and misused by bad actors/incompetents for years (who would not have done better using a different process)
3. Jaded/opportunistic talking heads tell us it's all garbage while simultaneously explaining that "well, it would be great if it wasn't applied poorly..."
It's not "zooming in" to point out that the first and second rules in Bob's work are "functions should be absurdly tiny, 4 lines or less" and that in the real world that results in unreadable garbage. This isn't digging through and looking for edge cases - all of the rules are fundamentally flawed.
Sure, if you summarize the whole book as "keep things small with a single purpose" that's not an awful message, but that's not the book. Other books have put that point better without all of the problems. The book is full of detailed specific instructions, and almost all of the specifics are garbage that causes more bad than good in the real world.
Clean Code has no nuance, only dogma, and that's a big problem (a point the second article I linked calls out and discusses in depth). There are some good practices in it, but basically all of its code is a mistake that is harmful to a new engineer to read.
Assuming that you have read the book, I find it odd that you would consider that to be the steel-man a fan of this work would invent, it considers considerably more ground than that:
- Prioritise human-readability
- Use meaningful names
- Consistent formatting
- Quality comments
- Be DRY, stop copy-pasting
- Test
- SOLID
All aspects of programming, to this day, I routinely see done lazily and poorly. This rarely correlates with experience, and usually with aptitude.
>Clean Code has no nuance, only dogma, and that's a big problem (a point the second article I linked calls out and discusses in depth)
It's opinionated and takes it's line of reason to the Nth degree. We can all agree that the application of the rules require nuance and intelligence. The second article you linked is a lot more forgiving and pragmatic than your characterisation of the issue.
I would expect the entire industry to do a better job of picking apart and contextualising the work, after it made an impact on the industry, than the author himself could or ever will be capable of.
My main problem is the inanity of reactionary criticism which doesn't engage with the ideas. Is Clean Code responsible for a net negative effect on our profession, directly or indirectly? Are we correlating a negative trend in ability with the influence of this work? What exactly are "Dirty Code" mug salesmen proposing as an alternative; what are they even proposing as being the problem, other than the examples in CC are bad and it's easy to misapply it's principles?
Except Uncle Bob, it seems, as evidenced by his code samples and his presentations in the years since that book came out. That's my objection. Many others have presented Bob's ideas better in the last 19 years. The book was good at the time, but we're a decade past when we should have stopped recommending it. Have folks go read Ousterhout instead - shorter, better, more durable.
Now, to your points: 1) Regarding adding more words to the context window, it's not about "more"; it's about "enough." If you don't have enough context for your task, how will you accomplish it? "Go there, I don't know where." 2) Regarding "problem solved," if the LLM suggests or does such a thing, it only means that, given the current context, this is how the average developer would solve the issue. So it's not an intelligence issue; it's a context and training set issue! When you write that "software engineers can step back, think about the whole thing, and determine the root cause of a problem," notice that you're actually referring to context. If the you don't have enough context or a tool to add data, no developer (digital or analog) will be able to complete the task.
That seems to me like a perfectly fine description of state space & chain of though continuation.
> LLMs get endlessly confused: they assume the code they wrote actually works; when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over. This is exactly the opposite of what I am looking for. Software engineers test their work as they go. When tests fail, they can check in with their mental model to decide whether to fix the code or the tests, or just to gather more data before making a decision. When they get frustrated, they can reach for help by talking things through. And although sometimes they do delete it all and start over, they do so with a clearer understanding of the problem.
My experiences are based on using Cline with Anthropic Sonnet 3.7 doing TDD on Rails, and have a very different experience. I instruct the model to write tests before any code and it does. It works in small enough chunks that I can review each one. When tests fail, it tends to reason very well about why and fixes the appropriate place. It is very common for the LLM to consult more code as it goes to learn more.
It's certainly not perfect but it works about as well, if not better, than a human junior engineer. Sometimes it can't solve a bug, but human junior engineers get in the same situation too.
Yup.
> I ... review each one
Yup.
These two practices are core to your success. GenAI hangs reliably hangs itself given longer rope.
I say capture logs without overriding console methods -> they override console methods.
YOU ARE NOT ALLOWED TO CHANGE THE TESTS -> test changed
Or they insert various sleep calls into a test to work around race conditions.
This is all from Claude Sonnet 4.
Say "keep your hands at your side, it's hot" and not "don't touch the stove, it's hot". If you say the latter, most kids touch the stove.
This reminds me of the query "shirt without stripes" on any online image/product search.
Tthere is a steady decline in model's capabilities across the board as their contexts get longer. Wiping the slate clean regularly really helps to counteract this, but it can really become a pain to rebuild the context from scratch over and over. Unfortunately, I don't really know any other way to avoid the model's getting really dumb over time.
OTOH I tried building a native Windows Application using Direct2D in Rust and it was a disaster.
I wish people could be a bit more open about what they build.
Here's what works however:
Mostly CRUD apps or REST API in Rails, Django or other Microframeworks such as FastAPI etc.
Or with React.
In that too, focus on small components and small steps or else you'll fail to get the results.
That is, so long as you stay inside the guard rails. Ask it to make something in a rails app that's slightly beyond the CRUD scope and it will suffer - much like most humans would.
So it's not that it's bad to let bots do boilerplate. But using very qualified humans for that to begin with was a waste to begin with. Hopefully in a few years none of us will need to do ANY part of CRUD work and we can do only the fun parts of software development.-
My ChatGPT is amazingly competent at gardening! Well, that’s how it feels anyway. Is it correct? I have no idea. It sounds right. Fortunately, it’s just a new hobby for me and the stakes are low. But generally I think it’s much better to be paranoid than gullible when it comes to confident sounding ramblings, whether it’s from an LLM or a marketing guru.
I would say for the last 6 months, 95% of the code for my chat app (https://github.com/gitsense/chat) was AI generated (98% human architected). I believe what I created in the last 6 months was far from trivial. One of the features that AI helped a lot with, was the AI Search Assistant feature. You can learn more about it here https://github.com/gitsense/chat/blob/main/packages/chat/wid...
As a debugging partner, LLMs are invaluable. I could easily load all the backend search code into context and have it trace a query and create a context bundle with just the affected files. Once I had that, I would use my tool to filter the context to just those files and then chat with the LLM to figure out what went wrong or why the search was slow.
I very much agree with the author of the blog post about why LLMs can't really build software. AI is an industry game changer as it can truly 3x to 4x senior developers in my opinion. I should also note that I spend about $2 a day on LLM API calls (99% to Gemini 2.5 Flash) and I probably have to read 200+ LLM generated messages a day and reply back in great detail about 5 times a day (think of an email instead of chat message).
Note: The demo on that I have in the README hasn't been setup, as I am still in the process of finalizing things for release but the NPM install instructions should work.
I can think of nothing more tiresome than having to read 200 emails a day, or LLM chat messages. And then respond in detail 5 of those times. It wouldn't lead to "3x to 4x" performance gain after tallying up all the time reading messages and replying. I'm not sure people that use LLMs this way are really tracking their time enough to say with any confidence that "3x to 4x" is anywhere close to reality.
I'm going to start producing metrics regarding how much code is AI generated along with some complexity metrics.
I am obviously bias, but this definitely feels like a paradigm shift and if people do not fully learn to adapt to it, it might be too late. I am not sure if you have ever watched Gattaca, but this sort of feels like it...the astronaut part, that is.
The profession that I have known for decades is starting to feel very different, in the same way that while watching Gattaca, my perception of astronauts changed. It was strange, but plausible and that is what I see for the software industry. Those that can articulate the problem I believe will become more valuable than the silent genius.
Why would it ever be too late?
This is very measurable, as you are not measuring against others, but yourself. The baseline is you, so it is very easy to determine if you become more productive or not. What you are saying is, you do not believe "you" can leverage AI to be more efficient than you currently are, which may well be true due to your domain and expertise.
Business is business, and if you can demonstrate that you are needed they will keep you, for the most part, but business also has politics.
> probably monitoring how much we use the "AI" and that could become a metric for job performance
I will bet on this and take it one step further. They (employer) are going to want to start tracking LLM conversations. If everybody is using AI, they (employer) will need differentiators to justify pay raises, promotions and so forth.
> they (employer) will need differentiators to justify pay raises, promotions and so forth.
That is exactly what I meant.
But you need to get your workflow right.
Claiming that the people making an AI coding tool (Zed) don't know LLM coding tools is both preposterous and extremely arrogant.
> when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over
Is the EXACT OPPOSITE of what LLM's tend to do. They are very stubborn in their approach and will keep at it often until you rollback to a previous prompt. Them deleting code tends to happen on command, except specifically if I do TDD, which may as well be a preemptive command to do so.
Even if you supply them with the file content, they are not able to recall it, or if they do, they will quickly forget.
For example, if you tell them that the "Invoice" model has fields x, y, z and supply part of the schema.
A few responses later, in the response it will give you an Invoice model that has a,b,c , because those are the most common ones.
Adding to this, you have them writing tautology tests, removing requirements to fix the bugs and hallucinating new requirements and you end up with catastrophic consequences.
It does not works so well for any problems it has not seen before. At that point you need to explain the problem, and instruct the solution. So a that point, you're just acting as a mentor instead of using your capacity to just implement the solution yourself.
My whole team has really bought into the "claude-code" way of doing side tasks that have been on the backlog for years, think like simple refactors, or secondary analytic systems. Basically any well-trodden path that is mostly constrained by time that none of us are given, are perfect for these agents right now.
Personally I'm enjoying the ability to highlight a section of code and ask the LLM to explain this to me like I'm 5, or look for any potential race conditions. For those archiac, fragile monolithic blocks of code that stick around long after the original engineers have left, it's magical to use the LLM to wrap my head around that.
I haven't found it can write these things any better though, and that is the key here. It's not very good at creating new things that aren't commonly seen. It also has a code style that is quite different than what already exists. So when it does inject code, often times it has to be rewritten to fit the style around it. Already, I'm hearing whispers of people say things like "code written for the AI to read." That's where my eyes roll because the payoff for the extra mental bandwidth doesn't seem worth it right now.
I haven't tried it with Rails myself (haven't touched Ruby in years, to be honest), but it doesn't surprise me that it would work well there. Ruby on Rails programming culture is remarkably consistent about how to do things. I would guess that means that the LLM is able to derive a somewhat (for lack of a better word) saner model from its training data.
By contrast, what it does with Python can get pretty messy pretty quickly. One of the biggest problems I've had with it is that it tends to use a random hodgepodge of different Python coding idioms. That makes TDD particularly challenging because you'll get tests that are well designed for code that's engineered to follow one pattern of changes, written against a SUT that follows conventions that lead to a completely different pattern of changes. The result is horribly brittle tests that repeatedly break for spurious reasons.
And then iterating on it gets pretty wild, too. My favorite behavior is when the real defect is "oops I forgot to sort the results of the query" and the suggested solution is "rip out SqlAlchemy and replace it with Django."
R code is even worse; even getting it to produce code that follows a spec in the first place can be a challenge.
it does a good enough job of wrangling behavior via implied context of the test-space that it seems to really reduce the amount of explanation needed and surprise garbage output.
The reality is the author very much understands what's available today. Zed, after all, is building out a lot of AI-focused features in its editor and that includes leveraging SOTA LLMs.
> It's certainly not perfect but it works about as well, if not better, than a human junior engineer. Sometimes it can't solve a bug, but human junior engineers get in the same situation too.
I wonder if comments like this are more of a reflection on how bad the hiring pool was even a few years ago than a reflection of how capable LLMs are. I would be distraught if I hired a junior eng with less wherewithal and capabilities than Sonnet 3.7.
I see this line of reasoning a lot from AI-advocates and honestly it's depressing. Do you see less experienced engineers as nothing more than outputters of code? Is the entire point of being "junior" at something that you can learn and grow, which these LLM tools cannot.
Uh...
This author is developing "LLMs and coding tools of today." It's not like they're just making a typical CRUD Rails app.
People complained endlessly about the internet in the early to mid 90s, its slow, static, most sites had under construction signs on them, your phone modem would just randomly disconnect. The internet did suck in alot of ways and yet people kept using it.
Twitter sucked in the mid 2000s, we saw the fail whale weekly and yet people continued to use it for breaking news.
Electric cars sucked, no charging, low distance, expensive and yet no matter how much people complain about them they kept getting better.
Phones sucked, pre 3G was slow, there wasn't much you could use them for before app stores and the cameras were potato quality and yet people kept using them while they improved.
Always look for the technology that sucks and yet people keep using it because it provides value. LLM's aren't great at alot of tasks and yet no matter how much people complain about them, they keep getting used and keep improving through constant iteration.
LLM"s amy not be able to build software today, but they are 10x better than where they were in 2022 when we first started using chatgpt. Its pretty reasonable to assume in 5 years they will be able to do these types of development tasks.
A lot of what you described as "sucked" were not seen as "sucking" at the time. Nobody complained about the phones being slow because nobody expected to use phones the way we do today. The internet was slow and less stable but nobody complained because they expected to stream 4k movies and they could not. This is anachronistic.
The fact that we can see how some things improved in X Y manner does not mean that LLMs will improve the way you think they will. Maybe we invent a different technology that does a better job. After it was not that dial up itself became faster and I don't think there were fanatics saying that dialup technology would give us 1Gbp speeds. The problem with AI is that because scaling up compute has provided breakthroughs, some think that somehow with scaling up compute and some technical tricks we can solve all the current problems. I don't think that anybody can say that we cannot invent a technology that can overcome these, but if LLMs is this technology that can just keep scaling has been under doubt. Last year or so there has been a lot of refinement and broadening of applications, but nothing like a breakthrough.
Has VR really improved 10x? I lost touch after the HTC Vive and heard about Valve Index but I was under the impression that even the best that Apple has on offer is 2x at most.
This is a big rewrite of history. Phones took off because before mobile phones the only way to reach a person was to call when they were at home or their office. People were unreachable for timespans that now seem quaint. Texting brought this into async. The "potato" cameras were the advent of people always having a camera with them.
People using the Nokia 3210 were very much not anticipating when their phones would get good, they were already a killer app. That they improved was icing on the cake.
It always bugs me whenever I hear someone defend some new tech (blockchain, LLMs, NFTs) by comparing it with phones or the internet or whatever. People did not need to be convinced to use cell phones or the internet. While there were absolutely some naysayers, the utility and usefulness of these technologies was very obvious by the time they became available to consumers.
But also, there's survivorship bias at play here. There are countless promising technologies that never saw widespread adoption. And any given new technology is far more likely to end up as a failure then it is to become "the next iPhone" or "the new internet."
In short, you should sell your technology based on what it can do right now, instead of what it might do in the future. If your tech doesn't provide utility right now, then it should be developed for longer before you start charging money for it. And while there's certainly some use for LLMs, a lot of the current use cases being pushed (google "AI overviews", shitty AI art, AIs writing out emails) aren't particularly useful.
For example, it would be wrong for me to say that "hyperloop got a ton of hype and investments, and it failed. Therefore LLMs, which are also getting a ton of hype and investments, will also fail." Hyperloop and LLMs are fundamentally different technologies, and the failure of hyperloop is a poor indicator of whether LLMs will ultimately succeed.
Which isn't to say we can't make comparisons to previous successes or failures. But those comparisons shouldn't be your main argument for the viability of a new technology.
It may have helped that shopping carts were actively designed to be pushed.
My main argument for the viability of the technology is that it's useful today. Even if it doesn't improve from here, my job as a coder has already been changed.
This is so annoying to me. My job as a coder hasn't changed because my responsibilities as a coder hasn't changed
Whether or not I beg an LLM to write code for me or write it myself the job is the same. At best there's a new tool to use but the job hasn't changed.
Carts were a necessity to get people to interact with the new "center aisles" of the grocery store which is mostly full of boxed and canned garbage.
In the early and 1990s, people effectively did not use the internet. Usage was tiny and miniscule, limited to only tiny niche groups. People heard about the internet via the 90 second blurb on the evening new show. It wasn't until sometime after the launch of Facebook that the internet was even mainstream. So I really don't think people complained about the internet being slow that they weren't using.
I can go on here, but I don't really need to spend paragraphs refuting something that is obviously false.
Classic LLM behavior
Ha, generally when someone can't disprove something its because they don't have a valid point. You not being able to disprove my point is very telling:)
Me, I agree with the author of the article. It's possible that the technology will eventually get there, but it doesn't seem to be there now. And I prefer to make decisions based on present-day reality instead of just assuming that the future I want is the future I'll get.
Ha;) Yes, when you provide examples to prove your point they are, by definition, selective:)
You are free to develop your own mental models of what technology and companies to invest in. I was only trying to share my 20 years of experience with investing to show why you shouldn't discard current technology because of its current limits.
Engineering decisions, which is closer to what TFA is talking about, tend to have to be a lot more focused on the here & now. You can make bets on future R&D developments (e.g, the Apollo program), but that's a game best played when you also have some control over R&D budgeting and direction (e.g, the Apollo program), and when you don't have much other choice (e.g, the Apollo program).
Specifically, to me the limitation of LLMs is discovering new knowledge and being able to reason about information they haven't seen before. LLMs still fail at things like counting the number of b's in the word blueberry or not getting distracted by inserting random cat facts in word problems (both issues I've seen appear in the last month)
I don't mean that to say they're a useless tool, I'm just not into the breathless hype.
The latest releases are seeing smaller and smaller improvements, if any. Unless someone can explain the technical reasons why they're likely to scale to being able to do X then it's a pretty useless claim
We can expect them to be better in 5 years, but your last assertion doesn't follow. We can't assert with any certainty that they will be able to specifically solve the problems laid out in the article. It might just not be a thing LLMs are good at, and we'll need new breakthroughs that may or may not appear.
FWIW - 3d printing has come a far way, and I personally have a 3D printer. But the idea that it was going to completely disrupt manufacturing is simply not true. There are known limitations (how the heck are you going to get a wood polymer squeezed through a metal tip?) and those limitations are physics, not technical ones.
They haven't continued to see massive adoption and improvement despite the flaws people point out.
They had initial success at printing basic plastic pieces but have failed to print in other materials like metal as you correctly point out, so these wouldn't pass my screening as they currently sit.
It takes them over a century to get to this current point.
And NFTs had a lot of loud detractors.
And everyone complained about a million other solutions that did not go anywhere.
Still, a bunch of investors made a lot of money on VR and very much so on NFT. Investments being good is not an indicator of anything being useful.
And NFTs was always perceived as a scam, same as the breathless blockchain no sense.
LLMs have many many issues, but I think they stick out as different to the other examples.
So consider your analogy, that the internet was always useful, but it was javascript that caused the actual titanic shift in the software industry. Even though the core internet backbone didn't radically improve as fast as you imagine it would have. Javascript was hacked together as a toy scripting language meant to make pages more interactive, but turns out, it was the key piece in unlocking that 10x value of the already existing internet.
Agents and the explosion of all these little context services are where I see the same thing happening here. Right now they are buggy, and mostly experimental toys. However, they are unlocking that 10x value.
Was it? I remember a lot more installable software than you do being the core usage of computers. Even today, most people are using apps.
most of the vibe shift I think I’ve seen in the past few months to using LLMs in the context of coding has been improvements in dataset curation and ux, not fundamentally better tech
That doesn't seem unexpected. Any technological leap seem to happen in sigmoid-like steps. When a fruitful approach is discovered we run to it until diminishing returns sets in. Often enough a new approach opens doors to other approaches that builds on it. It takes time to discover the next step in the chain but when we do we get a new sigmoid-like leap. Etc...
I.e. combining new approaches around old school "AI" with GenAI. That's probably not exactly what he's trying to do but maybe somewhere in the ball park.
1 - https://x.com/victortaelin
Go open the OpenAI API playground and give GPT3 and GPT5 the same prompt to make a reasonably basic game in JavaScript to your specification and watch GPT 3 struggle and GPT 5 one-shot it.
Thing is breakthroughs are always X years away (50 for fusion power for example).
The only example he gave that actually was kind of a big deal was mobile phones where capacitive touchscreens really did catapult the technology forward. But it is not like celphones weren't already super useful, profitable and getting better over time before capacitive touchscreens were introduced.
Maybe broadband to the internet also qualifies.
I think a lot of them relied on gradual improvement and lots of 'mini-breakthroughs' rather than one single breakthrough that changes everything. These mini-breakthroughs took decades to realise themselves properly in almost every example on the list too, not just a couple of years.
My personal gut feel is that even if the core technology plateau's, there's still lots of iterative improvement to go after on the productisation/commercialisation of the existing technology (e.g. improving tooling/ui/applying it to solving real problems/productising current research etc).
In electric car terms - we are still at the stage where Tesla is shoving batteries in a lotus elise, rather than releasing the model 3. We might have the lithium polymer batteries, but there's still lots of work to do to pull it into the final product.
(Having said this - I don't think the technology has plateau'd - I think we are just looking at it across a very narrow time span. If in 1979 you said that computers had plateau'd in 1979 because there hadn't been much progress in the last 12 months they would have been very wrong - breakthrough's sometimes take longer as technology matures, but that doesn't mean that the technology two decades from now won't be substantially different.
Yes, the newest models are so much better that they obsolete the old ones, but now the biggest differences between models is primarily what they know (parameter count and dataset quality) and how much they spend thinking (compute budget).
Uhhh, no?
In the past month we've had:
- LLMs (3 different models) getting gold at IMO
- gold at IoI
- beat 9/10 human developers at atcode heuristics (optimisations problems) with the single human that actually beat the machine saying he was exhausted and next year it'll probably be over.
- agentic that actually works. And works for 30-90 minute sessions while staying coherent and actually finishing tasks.
- 4-6x reduction in price for top tier (SotA?) models. oAI's "best" model now costs 10$/MTok, while retaining 90+% of their previous SotA models that were 40-60$/MTok.
- several "harnesses" being released by every model provider. Claude code seems to remain the best, but alternatives are popping off everywhere - geminicli, opencoder, qwencli (forked, but still), etc.
- opensource models that are getting close to SotA, again. Being 6-12months behind (depending on who you ask), opensource and cheap to run (~2$/MTok on some providers).
I don't see the plateauing in capabilities. LLMs are plateauing only in benchmarks, where number goes up can only go up so far until it becomes useless. IMO regular benchmarks have become useless. MMLU & co are cute, but agentic whatever is what matters. And those capabilities have only improved. And will continue to improve, with better data, better signals, better training recipes.
Why do you think eveyr model provider is heavily subsidising coding right now? They all want that sweet sweet data & signals, so they can improve their models.
A (bad) analogy would be that I can pretty easily tell the difference between a cat and an ape, and the differences in capability are blatantly obvious - but the improvement when going from IQ 70 to Einstein are much harder to assess and arguably not that useful for most tasks.
I tend to find that when I switch to a new model, it doesn't seem any better, but then at some point after using it for a few weeks I'll try to use the older model again and be quite surprised at how much worse it is.
All these things are not black boxes and they are mostly deterministic. Based on the inputs, you EXACTLY know what to expect as output.
That's not the case with LLMs, how they are trained and how they work internally.
We certainly get a better understanding on how to adjust the inputs so we get a desired output. But that's far from guaranteed at the same level as the examples you mentioned.
That's a fundamental problem with LLMs. And you can see that in how industry actors are building solutions around that problem. Reasoning (chain-of-thought) is basically a band-aid to narrow a decision tree, because the LLM does not really "reason" about anything. And the results only get better with more training data. We literally have to brute-force useful results by throwing more compute and memory at the problem (and destroying the environment and climate by doing so).
The stagnation of recent model releases does not look good for this technology.
If we put human engineering teams in the same situation, we’d expect them to do a terrible job, so why do we expect LLMs to do any better?
We can dramatically improve the output of LLM software development by using all those processes and tools that help engineering teams avoid these problems:
https://jim.dabell.name/articles/2025/08/08/autonomous-softw...
granted- it needs careful planning for CLAUDE.md and all issues and feature requests need a lot of in-depth specifics but it all works. so I am not 100% convinced by this piece. I'd say it's def not easy to get coding agents to be able to manage and write software effectively and specially hard to do so in existing projects but my experience has been across that entire spectrum. I have been sorely disappointed in coding agents and even abandoned a bunch or projects and dozens of pull requests but I have also seen them work.
you can check out that project here: https://github.com/julep-ai/steadytext/
If the LLM started sketching up screens and asked questions back about the intention of the software, then I am sure people would have a much better experience.
Plus, the most creative solutions often comes from implicit and explicit constraints. This is entirely a human skill and something we excel at.
These LLMs aren't going to "consider" something, understand the constraints, and then fit a solution inside those constraints that weren't explicitly defined for it somehow. If constraints aren't well understood, either through common problems, or through context documents, it will just go off the deep end trying to hack something together.
So right now we still need to rely on humans to do the work of breaking problems down, scoping the work inside of those constraints, and then coming up with a viable path forward. Then, at that point, the LLM becomes just another way to execute on that path forward. Do I use javascript, rust, or Swift to write the solution, or do I use `CLAUDE.md` with these 30 MCP services to write the solution.
For now, it's just another tool in the toolbox at getting to the final solution. I think the conversations around it needing to be a binary either, all or nothing, is silly.
If it isn't easy to give commands to LLMs, then what is the purpose of them?
Because LLMs were trained for one shot performance and they happen to beat humans at that.
(Also, there is no conflict of interest here, and you do not need to yell. I’m free to criticize my company if I like.)
If you do the thinking and let the LLM do the typing it works incredibly well. I can write code 10x faster with AI, but I’m maintaining the mental model in my head, the “theory” as Naur calls it. But if you try to outsource the theory to the LLM (build me an app that does X) you’re bound to fail in horrible ways. That’s why Claude Code is amazing but Replit can only do basic toy apps.
The more I use claude code, the more frustrated I get with this aspect. I'm not sure that a generic text-based LLM can properly solve this.
My gut feeling is that this problem won't be solved until some new architecture is invented, on the scale of the transformer, which allows for short-term context, long-term context, and self-modulation of model weights (to mimic "learning"). (Disclaimer: hobbyist with no formal training in machine learning.)
[0]: https://news.ycombinator.com/item?id=44798166
LLMs techniques allows us to extract rules from text and other data. But those data are not representative of a coherent system. The result itself is incoherent and lacks anything that wasn’t part of the data. And that’s normal.
It’s the same as having a mathematical function. Every point that it maps to is meaningful, everything else may as well not exists.
That and other tricks have only made me slightly less frustrated, though.
You can let it do the grunt coding, and a lot of the low level analysis and testing, but you absolutely need to be the one in charge on the design.
It frankly gives me more time to think about the bigger picture within the amount of time I have to work on a task, and I like that side of things.
There's definitely room for a massive amount of improvement in how the tool presents changes and suggestions to the user. It needs to be far more interactive.
My experience with prompting LLMs for codegen is really not much different from my experience with querying search engines - you have to understand how to ‘speak the language’ of the corpus being searched, in order to find the results you’re looking for.
I keep saying it and no one really listens: AI really is advanced autocomplete. It's not reasoning or thinking. You will use the tool better if you understand what it can't do. It can write individual functions pretty well, stringing a bunch of them together? not so much.
It's a good tool when you use it within its limitations.
> LLMs get endlessly confused: they assume the code they wrote actually works; when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over.
I see this constantly with mediocre developers. Flailing, trying different things, copy-pasting from StackOverflow without understanding, ultimately deciding the compiler must have a bug, or cosmic rays are flipping bits.
I feel this way because at my company our interns on a gap year from their comp sci degree don't blame the compiler, cosmic bits, or blindly copy from stack overflow.
They're incentivized and encouraged to learn and absolutely choose to do so. The same goes for seniors.
If you say 'I've been learning about X for ticket Y' in the standup people basically applaud it, managers like us training ourselves to be better.
Sure managers may want to see a brief summary or a write-up applicable to our department if you aren't putting code down for a few days, but that's the only friction.
However the other day I gave ChatGPT a relatively simple assignment, and it kept ignoring the rules. Every time I corrected it, it broke a different rule. I was asking it for gender-neutral names, but it kept giving last names like Orlov (which becomes Orlova), or first names that are purely masculine.
Is it the same with vibe coding?
Tried using it for the first time for vibe coding and was quite disappointed with the overall result, felt like a college student hastily copy pasting code from different sources for a project due tomorrow.
Maybe I just gave bad prompts…
I find it to be the most challenging part. There's a large amount of unstated assumptions that you take for granted, and if you don't provide them all upfront, you'll need to regenerate the code, again and again. I now invest a lot of time into writing all this down before I generate any code.
> But what they cannot do is maintain clear mental models.
The emphasis should be on maintain. At some point, the AI tends to develop a mental model, but over time, it changes in unexpected ways or becomes absent altogether. In addition, the quality of the mental models is often not that good to begin with.
What's missing is a part with more plasticity that can work in parallel and bi-directionally interact with the current static models in real-time.
This would mean individually trained models based on their experience so that knowledge is not translated to context, but to weight adjustments.
Disclaimer: These are my not-terribly-informed layperson's thoughts :^)
The attention mechanism does seem to give us a certain adaptability (especially in the context of research showing chain-of-thought "hidden reasoning") but I'm not sure that it's enough.
Thing is, earlier language models used recurrent units that would be able to store intermediate data, which would give more of a foothold for these kind of on-the-fly adjustments. And here is where the theory hits the brick wall of engineering. Transformers are not just a pure machine learning innovation, the key is that they are massively scalable, and my understand is part of this comes from the _lack_ of recurrence.
I guess this is where the interest in foundation models comes from. If you could take a codebase as a whole and turn it into effective training data to adjust the weights of an existing, more broadly-trained model, But is this possible with a single codebase's worth of data?
Here again we see the power of human intelligence at work: the ability to quite consciously develop new mental models even given very little data. I imagine this is made possible by leaning on very general internal world-models that let us predict the outcomes of even quite complex unseen ("out-of-distribution") situations, and that gives us extra data. It's what we experience as the frustrations and difficulties of the learning process.
What's happened for me recently is I've started to revisit the idea that typing speed doesn't matter.
This is an age-old thing, most people don't think it really matters how fast you can type. I suppose the steelman is, most people think it doesn't really matters how fast you can get the edits to your code that you want. With modern tools, you're not typing out all the code anyway, and there's all sorts of non-AI ways to get your code looking the way you want. And that doesn't matter, the real work of the engineer is the architecture of how the whole program functions. Typing things faster doesn't make you get to the goal faster, since finding the overall design is the limiting thing.
But I've been using Claude for a while now, and I'm starting to see the real benefit: you no longer need to concentrate to rework the code.
It used to be burdensome to do certain things. For instance, I decided to add an enum value, and now I have to address all the places where it matches on that enum. This wasn't intellectually hard in the old world, you just got the compiler to tell you where the problems were, and you added a little section for your new value to do whatever it needed, in all the places it appeared.
But you had to do this carefully, otherwise you would just cause more compile/error cycles. Little things like forgetting a semicolon will eat a cycle, and old tools would just tell you the error was there, not fix it for you.
LLMs fix it for you. Now you can just tell Claude to change all the code in a loop until it compiles. You can have multiple agents working on your code, fixing little things in many places, while you sit on HN and muse about it. Or perhaps spend the time considering what direction the code needs to go.
The big thing however is that when you're no longer held up by little compile errors, you can do more things. I had a whole laundry list of things I wanted to change about my codebase, and Claude did them all. Nothing on the business level of "what does this system do" but plenty of little tasks that previously would take a junior guy all day to do. With the ability to change large amounts of code quickly, I'm able to develop the architecture a lot faster.
It's also a motivation thing: I feel bogged down when I'm just fixing compile errors, so I prioritize what to spend my time on if I am doing traditional programming. Now I can just do the whole laundry list, because I'm not the guy doing it.
interesting point and that matches my experience quite well. LLMs have been horrendous at creating a good design. Even on a micro scale I almost always have them refactor the functions they write
I certainly get a productivity boost at actually doing the implementation.. but the implementation is already there in my head or on paper. It's really hard to know the true improvement
I do find them useful for brainstorming. I can throw a bunch of code and tests at it and ask what edge cases I might want to consider, or anything I've missed. 9/10 of their suggestions I just skip over but often there's a few I integrate
Getting something that works vs creating something that'll do well in the medium-long term is just such a different thing that I'm not sure if they'll be able to improve at the second
I always have a whole bunch of things I want to change in the codebase I'm working on, and the bottleneck is review, not me changing that code.
LLM also helps you test.
Almost every quality software has is designed in from a higher abstraction level. Almost nothing is put there by fixing error after error.
But that's also where said junior learns something. If those juniors get replaced by machines and not even get hired any more, who is going to teach them?
> The real source of our theories is conjecture, and the real source of our knowledge is conjecture alternating with criticism.
(This is rephrased Karl Popper, and Popper cites an intellectual lineage beginning somewhere around Parmenides.)
I see writing tests as a criticism of the code you wrote, which itself was a conjecture. Both are attempting to approach an explanation in your mind, some platonic idea that you think you are putting on paper. The code is an attempt to do so, the test is criticism from a different direction that you have done so.
I've one thing that helps is using the "Red-Green-Refactor" language. We're in RED phase - test should fail. We're in GREEN phase - make this test pass with minimal code. We're in REFACTOR phase - improve the code without breaking tests.
This helps the LLM understand the TDD mental model rather than just seeing "broken code" that needs fixing.
I don't want a chat window.
I want AI workflows as part of my IDE, like Visual Studio, InteliJ, Android Studio are finally going after.
I want voice controlled actions on my native language.
Knowledge across everything on the project for doing code refactorings, static analysis with AI feedback loop, generating UI based out of handwritten sketches, programming on the go using handwriting, source control commit messages out of code changes,...
> AI is awesome for coding! [Opus 4]
> No AI sucks for coding and it messed everything up! [4o]
Would really clear the air. People seem to be evaluating the dumbest models (apparently because they don't know any better?) and then deciding the whole AI thing just doesn't work.
It happens on many topics related to software engineering.
The web developer is replying to the embedded developer who is replying to the architect-that-doesnt-code who is replying to someone with 2 years of experience who is replying to someone working at google who is replying to someone working at a midsize b2b German company with 4 customers. And on and on.
Context is always omitted and we're all talking about different things ignoring the day to day reality of our interlocutors.
> AI is awesome for coding! [Gpt-5 Pro]
> AI is somewhat awesome for coding! ["gpt-5" with verbosity "high" and effort "high"]
> AI is a pretty good at coding! [ChatGPT 5 Thinking through a Pro subscription with Juice of 128]
> AI is mediocre at coding! [ChatGPT 5 Thinking through a Plus subscription with a Juice of 64]
> AI sucks at coding! [ChatGPT 5 auto routing]
They need to mention significantly more than that: https://dmitriid.com/everything-around-llms-is-still-magical...
--- start quote ---
Do we know which projects people work on? No
Do we know which codebases (greenfield, mature, proprietary etc.) people work on? No
Do we know the level of expertise the people have? No.
Is the expertise in the same domain, codebase, language that they apply LLMs to? We don't know.
How much additional work did they have reviewing, fixing, deploying, finishing etc.? We don't know.
--- end quote ---
And that's just the tip of the iceberg. And that is an iceberg before we hit another one: that we're trying to blindly reverse engineer a non-deterministic blackbox inside a provider's blackbox
I feel personally described by this statement. At least on a bad day, or if I'm phoning it in. Not sure if that says anything about AI - maybe just that the whole "mental models" part is quite hard.
I recently tried to get AI to refactor some tests, which it proceeded to break. Then it iterated a bit till it had gotten the pass rate back up to 75%. At this point it declared victory. So yes, it does really seem like a human who really doesn't want to be there.
In the past week, I watched this video[1] from Welch Labs about how deep networks work, and it inspired an idea. I spent some time "vibe coding" with Visual Studio Code's ChatGPT5 preview and had it generate a python framework that can take an image, and teach a small network how to generate that one sample image.
The network was simple... 2 inputs (x,y), 3 outputs (r,g,b), and a number of hidden layers with a specified number of nodes per layer.
It's an agent, it writes code, tests it, fixes problems, and it pretty much just works. As I explored the space of image generation, I had it add options over time, and it all just worked. Unlike previous efforts, I didn't have to copy/paste error messages in and try to figure out how things broke. I was pleasantly surprised that the code just worked in a manner close to what I wanted.
The only real problem I had was getting .venv working right, and that's more of an install issue rather then the LLMs fault.
I've got to say, I'm quite impressed with Python's argparse library.
It's amazing how much detail you can get out of a 4 hidden layers of 64 values, and 3 output channels (rgb), if you're willing to through a few days of CPU time at it. My goal is to see just how small of a network I can make to generate my favorite photo.
As it iterates through checkpoints, I have it output an image with the current values, to compare against the original, it's quite fascinating to watch as it folds the latent space to capture major features of the photo, then folds some more to catch smaller details, over and over, as the signal to noise ratio very slowly increases over the hours.
As for ChatGPT5, maybe I just haven't run out of context window yet, but for now, it all just seems like magic.
[1] https://www.youtube.com/watch?v=qx7hirqgfuU
Cursor is a joke tho, windsurf is pretty okay.
Right now the scene is very polarized. You have the "AI is a failure, you can build anything serious, this bubble is going to pop any day now" camp, and the "AI has revolutionized my workflow, I am now 10x more productive" camp.
I mean these types of posts blow up here every. single. day.
However, I agree with the main thesis (that they can’t do it on their own). Also related to this this whole idea of “we will easily fix memory next” will turn out to be the same as “we can fix vision in one summer” turned out it’s 30 years later, much improved but still not fixed. Memory is hard.
I am a relative newbie to GPU development, and was writing a simple 2D renderer with WebGPU and its rust implementation, wgpu. The goal is to draw a few textures to a buffer, and then draw that buffer to the screen with a CRT effect applied.
I got 99% of the way there on my own, reading the guide, but then got stumped on a runtime error message. Something like "Texture was destroyed while its semaphore wasn't released". Looking around my code, I see no textures ever being released. I decide to give the LLM a go, and ask it to help me, and it very enthusiastically gives a few thing to try.
I try them, nothing works. It corrects itself with more things to try, more modifications to my code. Each time giving a plausible explanation as to what went wrong. Each time extra confident that it got the issue pinned down this time. After maybe two very frustrating hours, I tell it to go fuck itself, close the tab and switch my brain on again.
10 minutes later, I notice my buffer's format doesn't match the one used in the render pass that draws to it. Correct that, compile, and it works.
I genuinely don't understand what those pro-LLM-coding guys are doing that they find AIs helpful. I can manage the easy parts of my job on my own, and it fails miserably on the hard parts. Are those people only writing boilerplate all day long?
interesting time, interesting issue.
I wonder is this not just a proxy for intelligence?
That, and their software doesn't actually have any users, I find.
It's understandably frustrating that the promised future ended up being humans having to work how machines want.
That said, I agree with the conclusion. They do seem to be missing coherent models of what they work on - perhaps part of the reason they do so poorly on benchmarks like ARC, which are designed to elicit that kind of skill?
Vibing I often let it explain the implemented business logic (instead of reading the code directly) and judge that.
Maybe I need to do more homework on LLMs in general.
So does Microsoft and Github. At least that's what they were telling us the whole time. Oh wait.. they changed their mind i think a week ago.
That's actually an interesting point, and something I've noticed a lot myself. I find LLMs are very good at hacking around test failures, but unless the test is failing for a trivial reason often it's pointing at some more fundamental issue with the underlying logic of the application which LLMs don't seem to be able to pick up on, likely because they don't have a comprehensive mental model of how the system should work.
I don't want to point fingers, but I've been seeing this quite a bit in the code of colleagues who heavily use LLMs. On the surface the code looks fine, and they've produced tests which pass, but when you think about it for more than a minute you realise it doesn't really capture nuance of the requirements, and in a way a human who had a mental model of the how the system probably wouldn't have done...
Sometimes humans miss things in the logic when they're writing code, but these look more like mistakes in a line rather than a fundamental failure to comprehend and model the problem. And I know this isn't the case, because when you talk to these developers they get the problem perfectly well.
To know when the code needs fixing or a test you need a very clear idea of what should be happening and LLMs just don't. I don't know why that is. Maybe it's just they're missing context from the hours of reading tickets and technical discussions, or maybe it's their failure to ask questions when they're unsure of what should be happening. I don't know if this a fundamental limitation of LLMs (I'd suspect not personally), but this is a problem when using LLMs to code today and one that more compute alone probably can't fix.
> Recency bias: They suffer a strong recency bias in the context window.
> Hallucination: They commonly hallucinate details that should not be there.
To be fair, those are all issues that most human engineers I've worked with (including myself!) have struggled with to various degrees, even if we don't refer to them the same way. I don't know about the rest of you, but I've certainly had times where I found out that an important nuance of a design was overlooked until well into the process of developing something, forgotten a crucial detail that I learned months ago that would have helped me debug something much faster than if I had remembered it from the start, or accidentally make an assumption about how something worked (or misremembered it) and ended up with buggy code as a result. I've mostly gotten pretty positive feedback about my work over the course of my career, so if I "can't build software", I have to worry about the companies that have been employing me and my coworkers who have praised my work output over the years. Then again, I think "humans can't build software reliably" is probably a mostly correct statement, so maybe the lesson here is that software is hard in general.
That's what I've found as well. Start describing or writing a function, include the whole file for context and it'll do its job. Give it a whole codebase and it will just wander in the woods burning tokens for ten minutes trying to solve dependencies.
When you already know exactly what needs to be built and simply want to skip the drudgery of boilerplate or repetitive tasks, a coding CLI is great: it handles the grunt work so you can stay focused on the high-level design and decision-making that truly matter (and also more fun).
Taken to a next step, recognizing this makes the investment in such a moonshot pipedream (overcoming these inherent problems in a deterministic way), recklessly negligent.
Improvements in model performance seem to be approaching the peak rather than demonstrating exponential gains. Is the quote above where we land in the end?
I find Sonnet frequently loses the plot, but Opus can usually handle it (with sufficient clarity in prompting).
The first project is a C++ embedded device. The second is a sophisticated Django-based UI front end for a hardware device (so, python interacting with hardware and various JS libraries handling most of the front end).
So far I am deeper into the Django project than the C++ embedded project.
It's interesting.
I had already hand-coded a conceptual version of the UI just to play with UI and interaction ideas. I handed this to Cursor as well as a very detailed specification for the entire project, including directory structure, libraries, where to use what and why, etc. In other words, exactly what I would provide a contractor or company if I were to outsource this project. I also told it to take a first stab at the front end based on the hand-coded version I plopped into a temporary project directory.
And then I channeled Jean-Luc Picard and said "Engage!".
The first iteration took a few minutes. It was surprisingly functional and complete. Yet, of course, it had problems. For example, it failed to separate various screens into separate independent Django apps. It failed to separate the one big beautiful CSS and JS files into independent app-specific CSS and JS files. In general, it ignored separation of concerns and just made it all work. This is the kind of thing you might expect from a junior programmer/fresh grad.
Achieving separation of concerns and other undesirable cross-pollination of code took some effort. LLM's don't really understand. They simulate understanding very well, but, at the end of the day, I don't think we are there. They tend to get stuck and make dumb mistakes.
The process to get to something that is now close to a release candidate entailed an interesting combination of manual editing and "molding" of the code base with short, precise and scope-limited instructions for Cursor. For my workflow I am finding that limiting what I ask AI to do delivers better results. Go too wide and it can be in a range between unpredictable and frustrating.
Speaking of frustrations, one of the most mind-numbing things it does every so often is also in a range, between completely destroying prior work or selectively eliminating or modifying functionality that used to work. This is why limiting the scope, for me, has been a much better path. If I tell it to do something in app A, there's a reasonable probability that it isn't going to mess with and damage the work done in app B.
This issue means that testing become far more important in this workflow, because, on every iteration, you have no idea what functionality may have been altered or damaged. It will also go nuts and do things you never asked it to do. For example, I was in the process of redoing the UI for one of the apps. For some reason it decided it was a good idea to change the UI for one of the other apps, remove all controls and replace them with controls it thought were appropriate or relevant (which wasn't even remotely the case). And, no, I did not ask it to touch anything other than the app we were working on.
Note: For those not familiar with Django, think of an app as a page with mostly self-contained functionality. Apps (pages) can share data with each other through various means, but, for the most part, the idea is that they are designed as independent units that can be plucked out of a project and plugged into another (in theory).
The other thing I've been doing is using ChatGPT and Cursor simultaneously. While Cursor is working I work with ChatGPT on the browser to plan the next steps, evaluate options (libraries, implementation, etc.) and even create quick stand-alone single file HTML tests I can run without having to plug into the Django project to test ideas. I like this very much. It works well for me. It allows me to explore ideas and options in the context of an OpenAI project and test things without the potential to confuse Cursor. I have been trying to limit Cursor to being a programmer, rather than having long exploratory conversations.
Based on this experience, one thing is very clear to me: If you don't know what you are doing, you are screwed. While the OpenAI demo where they have v5 develop a French language teaching app is cool and great, I cannot see people who don't know how to code producing anything that would be safe to bet the farm on. The code can be great and it can also be horrific. It can be well designed and it can be something that would cause you to fail your final exams in a software engineering course. There's great variability and you have to get your hands in there, understand and edit code by hand as part of the process.
Overall, I do like what I am seeing. Anyone who has done non-trivial projects in Django knows that there's a lot of busy boilerplate typing that is just a pain in the ass. With Cursor, that evaporates and you can focus on where the real value lies: The problem you are trying to solve.
I jump into the embedded C++ project next week. I've already done some of it, but I'm in that mental space 100% next week. Looking forward to new discoveries.
The other reality is simple: This is the worse this will ever be. And it is already pretty good.
Years ago I gave up compiling these large applications all together. I compiled Firefox via FreeBSD's (v8.x) ports system, that alone was a nightmare.
I cannot imagine what it would be like to compile GNOME3 or KDE or Libreoffice. Emacs is the largest thing I compile now.
While a collision hasn't yet been found for a SHA256 package on Nix, by the pigeonhole principle they exist, and the computer will not be able to decide between the two packages in such a collision leading to system level failure, with errors that have no link to cause (due to the properties involved, and longstanding CS problems in computation).
These things generally speaking contain properties of mathematical chaos which is a state that is inherently unknowable/unpredictable that no admin would ever approach or touch because its unmaintainable. The normally tightly coupled error handling code is no longer tightly coupled because it requires matching a determinable state (CS computation problems, halting/decidability).
Non-deterministic failure domains are the most costly problems to solve because troubleshooting which leverages properties of determinism, won't work.
This leaves you only a strategy of guess and check; which requires intimate knowledge of the entire system stack without abstractions present.
A cursory look at a nix system would also show you that the package name, version and derivation sha are all concatenated together.
> A cursory look at a nix system would show ... <three things concattenated together>
This doesn't negate or refute the pigeonhole principle. In following pigeonhole there is some likelihood that a collision will exist, and that probability trends to 1 given sufficient iterations (time).
The only argument you have is a measure of likelihood and probability, which is a streetlight effect cognitive bias or intelligence trap. There's a video which discusses these type of traps on youtube, TED from an ex-CIA officer.
Likelihood and probability are heavily influenced by the priors they measure, and without perfect knowledge (which no one has today) those priors may deviate significantly, or be indeterminable.
Imagine for a second that a general method for rapidly predicting collisions, regardless of algorithm, is discovered and released; which may not be far off given current advances with quantum computing.
All the time and money cumulatively spent towards Nix, as cost becomes wasted, and you are left in a position of complete compromise suddenly and without a sound pivot for comparable cost (previously).
With respect, if you can't differentiate basic a priori reasoned logic from AI, I would question your perceptual skills and whether they are degrading. There is a growing body of evidence that exposure to AI may cause such degradation as seems to be starting to be seen with regards to doctors and their use and diagnostics after use in various studies (1).
1: https://time.com/7309274/ai-lancet-study-artificial-intellig...
It’s still early days, but we are learning that as with software written exclusively by humans, the more specific the specifications are, the more likely the result will be as you intended.
And it’s not a conflict of interest. I’m free to criticize my company if I like.
Perhaps good for someone just getting their feet wet with these computational objects, but not resolving or explaining things in a clear way, or highlighting trends in research and engineering that might point towards ways forward.
You also have a technical writing no no where you cite a rather precise and specific study with a paraphrase to support your claims … analogous to saying “Godel’s incompleteness theorem means _something something_ about the nature of consciousness”.
A phrase like: “Unfortunately, for now, they cannot (beyond a certain complexity) actually understand what is going on” referencing a precise study … is ambiguous and shoddy technical writing — what exactly does the author mean here? It’s vague.
I think it is even worse here because _the original study_ provides task-specific notions of complexity (a critique of the original study! Won’t different representations lead to different complexity scaling behavior? Of course! That’s what software engineering is all about: I need to think at different levels to control my exposure to complexity)