Publish AI, ML & data-science insights to a global community of data professionals.

Generalists Can Also Dig Deep

Ida Silfverskiöld on AI agents, RAG, evals, and what design choice ended up mattering more than expected

Photo of Ida Silfverskiöld
Photo courtesy of Ida Silfverskiöld

In the Author Spotlight series, TDS Editors chat with members of our community about their career path in data science and AI, their writing, and their sources of inspiration. Today, we’re thrilled to share our conversation with Ida Silfverskiöld.

Ida is a generalist, educated as an economist and self-taught in software engineering. She has a professional background in product and marketing management, which means she has a rare blend of product, marketing and development skills. Over the past few years, she’s been teaching and building in the LLM, NLP, and computer vision space, digging into areas such as agentic AI, chain‑of‑thought strategies, and the economics of hosting models.


You studied economics, then learned to code and moved through product, growth, and now hands-on AI building. What perspective does that generalist path give you that specialists sometimes miss?

I’m not sure. 

People see generalists as having shallow knowledge, but generalists can also dig deep. 

I see generalists as people with multiple interests and a drive to understand the whole, not just one part. As a generalist you look at the tech, the customer, the data, the market, the cost of the architecture, and so on. It gives you an edge to move across topics and still do good work. 

I’m not saying specialists can’t do this, but generalists tend to adapt faster because they’re used to picking things up quickly.

You’ve been writing a lot about agentic systems lately. When do “agents” actually outperform simpler LLM + RAG patterns, and when are we overcomplicating things?

It depends on the use case, but in general we throw AI into a lot of things that probably don’t need it. If you can control the system programmatically, you should. LLMs are great for translating human language into something a computer can understand, but they also introduce unpredictability.

As for RAG, adding an agent means adding costs, so doing it just for the sake of having an agent isn’t a great idea. You can work around it by using smaller models as routers (but this adds work). I’ve added an agent to a RAG system once because I knew there would be questions about building it out to also “act.” So again, it depends on the use case. 

When you say Agentic AI needs “evaluations” what’s your list of go-to metrics? And how do you decide which one to use?

I wouldn’t say you always need evals, but companies will ask for them, so it’s good to know what teams measure for product quality. If a product will be used by a lot of people, make sure you have some in place. I did quite a lot of research here to understand the frameworks and metrics that have been defined. 

Generic metrics are probably not enough though. You need a few custom ones for your use case. So the evals differ by application. 

For a coding copilot, you could track what percent of completions a developer accepts (acceptance rate) and whether the full chat reached the goal (completeness).

For commerce agents, you might  measure whether the agent picked the right products and whether answers are grounded in the store’s data.

Security and safety related metrics are important too, such as bias, toxicity, and how easy it is to break the system (jailbreaks, data leaks).

For RAG, see my article where I break down the usual metrics. Personally, I have only set up metrics for RAG so far.

It could be interesting to map how different AI apps set up evals in an article. For example, Shopify Sidekick for commerce agents and other tools such as legal research assistants.

In your Agentic RAG Applications article, you built a Slack agent that takes company knowledge into account (with LlamaIndex and Modal). What design choice ended up mattering more than expected? 

The retrieval part is where you’ll get stuck, specifically chunking. When you work with RAG applications, you split the process into two. The first part is about fetching the correct information, and getting it right is important because you can’t overload an agent with too much irrelevant information. To make it precise the chunks need to be quite small and relevant to the search query.

However, if you make the chunks too small, you risk giving the LLM too little context. With chunks that are too large, the search system may become imprecise.

I set up a system that chunked based on the type of document, but right now I have an idea for using context expansion after retrieval. 

Another design choice you need to keep in mind is that although retrieval often benefits from hybrid search, it may not be enough. Semantic search can connect things that answer the question without using the exact wording, whereas sparse methods can identify exact keywords. But sparse methods like BM25 are token-based by default, so plain BM25 won’t match substrings.

So, if you also want to search for substrings (part of product IDs, that kind of thing), you need to add a search layer that supports partial matches as well.

There is more, but I risk this becoming an entire article if I keep going.

Across your consulting projects over the past two years, what problems have come up most often for your clients, and how do you address them? 

The issues I see are that most companies are looking for something custom, which is great for consultants, but building in-house is riddled with complexities, especially for people who haven’t done it before. I saw that 95% number from the MIT study about projects failing, and I’m not surprised. I think consultants should get good at certain use cases where they can quickly implement and tweak the product for clients, having already learnt how to do it. But we’ll see what happens.

You’ve written on TDS about so many different topics. Where do your article ideas come from? Client work, tools you want to try, or your own experiments? And what topic or problem is top of mind for you right now?

A bit of everything, frankly. The articles also help me ground my own knowledge, filling in missing pieces I may not have researched myself yet. Right now I’m researching a bit on how smaller models (mid-sized, around 3B–7B) can be used in agent systems, security, and specifically how to improve RAG. 

Zooming out: what’s one non-obvious capability teams should cultivate in the next 12–18 months (technical or cultural) to become genuinely AI-productive rather than just AI-busy?

Probably learn to build in the space (especially for business people): just getting an LLM to do something consistently is a way to understand how unpredictable LLMs are. It makes you a bit more humble. 


To learn more about Ida‘s work and stay up-to-date with her latest articles, you can follow her on TDS or LinkedIn.


Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles