You don’t train on documents. There are many startups claiming that but they are deliberately using a misleading term because they know that’s what people are searching for.

You still do RAG. Llamaindex is still the best option that I know of. Most of the startups that have working products are likely using llamaindex. All of the ones that say they are training on documents are actually using RAG.

Test it out. If it really and truly doesn’t work, search for a script that creates question and answer pairs automatically with gpt-4. Then try using that for qLoRA. I have never heard of anyone successfully using that for a private document knowledgebase though. Only for skills like math, reasoning, Python, etc. I think the issue is that you need a LOT of data and it needs to repeat concepts or any facts you need to learn many, many times in different supporting ways.

What absolutely does not work is trying to just feed a set of documents into fine tuning. I personally have proven that dozens of times because I had a client who is determined to do it. He has been mislead.

What it will do is learn the patterns that are in those documents.

Source: Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023? | Hacker News