Table of contents
Data analysts need access to structured data. Open data formats give analysts the best chance of drawing conclusions about data.
During the course of data warehousing operations, data should be cleaned into a standard format. A standard format is one that will guarantee the highest chance of interoperability with past and future data sets. Through the combination of data sets of high quality, relevance, and accuracy we may draw conclusions valid to an organization.
tl;dr: recommended file formats are SQLite, JSON. Recommended analysis tools are DuckDB and jupyter notebooks.
Relational data
Most data today is stored in relational databases.
all data lives in tables; tables have columns, and rows are addressable with keys; C.R.U.D.; schemas; a textual language to convey these concepts. The language, of course, is SQL1
SQLite is a database file format that is the perfect way to store datasets. It is:
- Structured: data that is stored in SQLite must conform to a tabular format that can be easily manipulated
- Typed: Each cell of data must conform to a specific data type, or else be explicitly marked as
TEXT
or a binaryBLOB
. - Maintained: SQLite's on-disk file format and C-language API are fully backwards compatiable, documented, and supported until 20502
- Portable: SQLite files are identical bit-for-bit on all platforms2
References
Ben Johnson, I'm All-In on Server-Side SQLite, Version 2022-05-09, [Online]
SQLite Developers, Long Term Support, Accessed 2024-02-30, [Online]
WikiData Contributors, Wikibase/DataModel/Primer, Version 2024-02-18, [Online]