When engaged on knowledge science tasks, one basic pipeline to arrange is the one relating to knowledge assortment. Actual-world Machine Studying primarily differs from Kaggle-like issues as a result of knowledge just isn’t static. We have to scrape web sites, collect knowledge from APIs, and so forth. This manner of amassing knowledge may look chaotic, and it’s! That’s why we have to construction our code following greatest practices to carry some form of order to all this mess.
When you recognized the sources from which you wish to collect your knowledge, you could acquire them in a structured approach to retailer these in your database. For instance, you may resolve that with the intention to prepare your LLM what you want are knowledge sources which comprise 3 fields: creator, content material, and hyperlink.
What you can do is to obtain the info, after which write SQL queries to retailer and retrieve knowledge out of your database. Extra generally you may wish to implement all of the queries to carry out CRUD operations. CRUD stands for create, learn, replace, and delete. These are the 4 primary features of persistent storage.