{"id":230534,"date":"2024-01-11T17:00:38","date_gmt":"2024-01-11T09:00:38","guid":{"rendered":"https:\/\/magicalbits.net\/?p=230534"},"modified":"2024-01-11T17:00:38","modified_gmt":"2024-01-11T09:00:38","slug":"ask-hn-how-do-i-train-a-custom-llm-chatgpt-on-my-own-documents-in-dec-2023-hacker-news","status":"publish","type":"post","link":"https:\/\/magicalbits.net\/?p=230534","title":{"rendered":"Ask HN: How do I train a custom LLM\/ChatGPT on my own documents in Dec 2023? | Hacker News"},"content":{"rendered":"<blockquote><p>You don&#8217;t train on documents. There are many startups claiming that but they are deliberately using a misleading term because they know that&#8217;s what people are searching for.<\/p>\n<p>You still do RAG. Llamaindex is still the best option that I know of. Most of the startups that have working products are likely using llamaindex. All of the ones that say they are training on documents are actually using RAG.<\/p>\n<p>Test it out. If it really and truly doesn&#8217;t work, search for a script that creates question and answer pairs automatically with gpt-4. Then try using that for qLoRA. I have never heard of anyone successfully using that for a private document knowledgebase though. Only for skills like math, reasoning, Python, etc. I think the issue is that you need a LOT of data and it needs to repeat concepts or any facts you need to learn many, many times in different supporting ways.<\/p>\n<p>What absolutely does not work is trying to just feed a set of documents into fine tuning. I personally have proven that dozens of times because I had a client who is determined to do it. He has been mislead.<\/p>\n<p>What it will do is learn the patterns that are in those documents.<\/p><\/blockquote>\n<p>Source: <em><a href=\"https:\/\/news.ycombinator.com\/item?id=38759877\">Ask HN: How do I train a custom LLM\/ChatGPT on my own documents in Dec 2023? | Hacker News<\/a><\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>You don&#8217;t train on documents. There are many startups claiming that but they are deliberately using a misleading term because they know that&#8217;s what people are searching for. You still do RAG. Llamaindex is still the best option that I know of. Most of the startups that have working products are likely using llamaindex. All [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"ep_exclude_from_search":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-230534","post","type-post","status-publish","format-standard","hentry","category-uncategorised"],"jetpack_featured_media_url":"","jetpack-related-posts":[],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/magicalbits.net\/index.php?rest_route=\/wp\/v2\/posts\/230534","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/magicalbits.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/magicalbits.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/magicalbits.net\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/magicalbits.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=230534"}],"version-history":[{"count":1,"href":"https:\/\/magicalbits.net\/index.php?rest_route=\/wp\/v2\/posts\/230534\/revisions"}],"predecessor-version":[{"id":230535,"href":"https:\/\/magicalbits.net\/index.php?rest_route=\/wp\/v2\/posts\/230534\/revisions\/230535"}],"wp:attachment":[{"href":"https:\/\/magicalbits.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=230534"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/magicalbits.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=230534"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/magicalbits.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=230534"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}