#hello

Getting the embeddings right. Some things which I did wrong

I was diving deeper into the whole vector embeddings topic.

I used in a couple of projects the vector embeddings. Actually the setup is quite easy. I tested:

  • qdrant
  • pinecone
  • chroma

My feedback, purely opiniated and not scientific is that for self hosting and for the cloud I can recommend qdrant. Qdrant felt solid for self-hosting and their cloud option seemed straightforward. Pinecone is amazing for prototyping when you are creating an assistant with document data. The AI Assistant is the most accessible I found in the internet. Chroma and me couldn’t be friends, maybe it was just my setup, even though I love that it’s open source.

Where did this vector embedding idea even click for me (but never sparked really)?

Well this one is so fascinating actually, Back in the days when I finished studying computer science. I made some friends who were all geniouses in their field, I appreciated some so much that no matter what they got a phone call for specific questions. Till now and it is over 6 years ago. It’s always good to see people who are rising in their potential.

So there was one guy who was the most laziest somehow more focusesd on getting a girl and did not like to work at all. He loves too cook though. Drinking some nice whiskey and have a good evening with a good steak. That was him. His other side was absolutely fascinating. He was working for a very big concern which used google cloud for everything, and when I asked him about it he said this is total boring I don’t know what to do there. This google cloud is annoying mostly complains but he told me I just optimized something which took 1 hour to solve in a couple of seconds I just transformed it into vector search, I transformed every item into a vector, like this I was able to otpimize the search bla bla bla. And I was like, okay I was in the math classes amazing in all derivation and also pretty decent in matrices and transformation. But that you were able to transfer this knowledge to actually something useful in the real world is so amazing that this guy got me into looking into the raw materials of what we use know. This was about 5 years ago.

Fast forward to the last year or so, with this whole AI and embedding explosion. I keep thinking about that conversation. That friend, seemingly focused on anything but work, had basically shown me a core piece of the future without either of me fully realizing it then. He was already using the fundamental idea – representing things as vectors to understand similarity – that powers so much of today’s AI, especially RAG systems.

Okay, So Now We Have the Tools, But…

Now, of course, we have these crazy optimized, pre-trained models that make turning anything into vectors seem easy. It is extremely powerful technology, and honestly, if you’re using it, you really should take the time to understand the basics underneath.

But here’s the reality check I ran into: it’s not just plug-and-play. You absolutely cannot just put everything raw into an embedding model and expect magic. I’ve tried! That’s the fast track to getting weird results, maybe outright nonsense, or those “hallucinations” everyone talks about. And this is bad very bad for the highest preach in the business. Scalability. Everyone is investing and jumping on a train because of this one word. Yes you can scale it. But in reality. You can not quite scale with just a few lines of code and just drag and drop databases into the embeddings.

This is the point where, you should probably take a few steps back. I get it – AI feels like a non-stop race, and you feel pressure to use the latest and greatest right now. But honestly, taking the time to actually research and understand what you’re using, why it works, and how to prepare your data for it? That makes you way more effective in the long run.

Maybe think of it like this: during the gold rush, everyone is digging frantically. Maybe the smarter play isn’t just digging, but building really good shovels – focusing on solid foundations and understanding the tools. Those are the things that remain valuable even when the initial frenzy cools down.

Back to the topic.

You need to prepare transform the data the right way.

Sure, throwing messy ‘fuzzy’ data in might work sometimes, but it’s not recommended. That’s the quick and dirty stuff we should avoid. Having clean, well-understood data is gold here.

Question for you. Do you actually know what to embed? The main text? the metadata ?

Which Chunking Strategy should we use for big docs?

You usually can’t embed a whole massive document at once, models have limits and you want specific answers, not a hwhole chapter. Chunking is crucial since you need the middle way of context and specific answers. Here it is pure experimenting.

How do you know if your fancy embedding setup is any good? You gotta check. Does searching for X actually bring back relevant stuff? Are the top results useful? This isn’t always easy to measure automatically, sometimes it just needs a human to look and say “yep, this makes sense” or “nope, this is garbage”.

In my experience id was not working perfect, never. So what to do ?

Go specific first!

The key which created helped to actually get products into great useful tools was to strategy to not want to scale immediately. First we focus on one category one very specific part of the data one question we want to answer good or one prompt we want to execute to enrich manipulate given data the right way.

This will make you understand the “what context” do you actually need. What do you need to provide in which form.

Then step by step you create the tools and systems to make your agent grow. Later on you have maybe duplicated code in a way, but this is something to pay for good results. refactoring later when there you get more budget and want to have true scalability.

Why Bother With All This?

It sounds like a lot of work, right? And it can be. Which brings me back to my point about internal tools. Starting there often makes sense. Like the sales tool example – you know the users, you know the data (mostly), you can calculate the cost vs the benefit of making those salespeople faster and smarter. You’re not exposing a potentially flaky or expensive system to the wild web immediately.

Getting embeddings right means your AI search doesn’t suck. It means your RAG system can actually find the right context to give useful answers. It’s the foundation. And like any foundation, if it’s shaky, the whole thing built on top is gonna have problems. So yeah, worth spending some time thinking about these “best practices”, even if the perfect answer is often “it depends”. What’s your experience been?