I’d love if Firefox would feed the text content of each website I visit locally ...

bravura · on May 16, 2024

You could use archivebox with the archivebox web extension. And then use a separate offline / batch process to embed and RAG your archive.

jnnnthnn · on May 16, 2024

You might not want them to have that information, but I think Google's history search now supports that for Chrome users: https://myactivity.google.com/myactivity

klavinski · on May 17, 2024

I have built a Chrome extension to do this one year ago: [0]

Here is the list of technological problems:

1. When is a page ready to be indexed? Many websites are dynamic.

2. How to find the relevant content? (To avoid indexing noise)

3. How to keep an acceptable performance? Computing embeddings on each page is enough to transform a laptop into a small helicopter with its fans. (I used 384 as the embedding dimension. Below, too imprecise; above, too compute-heavy).

4. How to chunk a page? It is not enough to split the content into sentences. You must add context to them.

5. How to rank the results of a search? PageRank is not applicable here.

[0] https://www.youtube.com/watch?v=GYwJu5Kv-rA

rdli · on May 16, 2024

I'm working on something like this! It's simple in concept, but there are lots of fiddly bits. A big one is performance (at least, without spending $$$$$ on GPUs.) I haven't found that much in terms of how to tune/deploy LLMs on commodity cloud hardware, which is what I'm trying this out on.

leobg · on May 16, 2024

You can use ONXX versions of embedding models. Those run faster on CPU.

Also, don’t discount plain old BM25 and fastText. For many queries, keyword or bag-of-words based search works just as well as fancy 1536 dim vectors.

You can also do things like tokenize your text using the tokenizer that GPT-4 uses (via tiktoken for instance) and then index those tokens instead of words in BM25.

rdli · on May 16, 2024

Thanks! I should have been clearer -- embeddings are pretty fast (relatively) -- it's inference that's slow (I'm at 5 tokens/second on AKS).

jnnnthnn · on May 16, 2024

Could you sidestep inference altogether? Just return the top N results by cosine similarity (or full text search) and let the user find what they need?

https://ollama.com models also works really well on most modern hardware

rdli · on May 16, 2024

I'm running ollama, but it's still slow (it's actually quite fast on my M2). My working theory is that with standard cloud VMs, memory <-> CPU bandwidth is an issue. I'm looking into vLLM.

And as to sidestepping inference, I can totally do that. But I think it's so much better to be able to ask the LLM a question, run a vector similarity search to pull relevant content, and then have the LLM summarize this all in a way that answers my question.

jnnnthnn · on May 16, 2024

Oh yeah! What I meant is having Ollama run on the user's machine. Might not work for the use case you're trying to build for though :)

pizza · on May 16, 2024

This style of embeddings could be quite lightweight/cheap/efficient https://github.com/cohere-ai/BinaryVectorDB

Tostino · on May 16, 2024

Embedding models are generally lightweight enough to run on CPU, can be done in the background while the user isn't using their device.

stavros · on May 16, 2024

https://historio.us

2024throwaway · on May 16, 2024

The OP asked for the data to be stored locally. You linked to a hosted service with a subscription model. Very much not the same.

bravura · on May 16, 2024

I've been curious about historious for a while, but I want to ask how it's different from pinboard.

From what I can see, they both have the ability to archive / FTS your bookmarks.

But in terms of API access, historious only allows WRITE access (ugh), where at least pinboard allows read/write.

What else am I missing?

stavros · on May 16, 2024

historious indexes everything, whereas Pinboard (as far as I know) only indexes things you select. I haven't used Pinboard much, so I can't really say much.