Structured Query Languages (SQL) exists since the 1970s and have been first standardized around 1986 by the American National Standards Institute (ANSI). Their purpose was to have a human-understandable language to query data in tables in database management systems. This means SQL is a domain-specific language. Much later they have been also adopted to query data streams that can be virtually represented as tables consisting of sliding data windows. Various adaptations also exists to other types of databases (e.g. graph databases).
Dataframes became first popular with statistical programming languages (e.g. R) mainly starting in the 1990s and later in the early 2000s on more popular by their usage in other languages, such as Python with the Pandas library in 2009.
While SQL has become over time a powerful construct to do very advanced queries, it has never become as powerful as dataframes – especially for expressing complex data flows. The simple – but powerful – difference is that dataframes can express an dependencies between data queries and transformations, while this is not possible to do so in SQL. This means they allow definition of data-flows without introducing artificial control-flow dependencies and their orchestration.
Thus, dataframes reduce complexity and enables powerful concepts as I will explain later. I will also discuss the confusion that occurs because both are used often together.
Finally I will discuss the future of dataframes – especially making them generalisable as well as independent of a specific programming language or library.
SQL was originally and still is based on relational algebra, which uses algebraic structures for modeling and querying data in a non-ambiguous way. Originally, the main focus has been retrieving and storing data, but over time more aspects come into play, such as security, data structural aspects or ensuring data consistency. For example, commonly people distinguish between:
- Data Definition Language (DDL): Defining the structure of data (e.g. CREATE TABLE …)
- Data Query Language (DQL): Querying data (e.g. SELECT …)
- Data Manipulation Language (DML): Manipulating data (e.g. INSERT, UPDATE, DELETE..)
- Data Control Language (DCL): Defining access rights to data (e.g. GRANT, REVOKE…)
There are of course further aspects, e.g. defining transactions (multiple statements that need to run at the same time, e.g. BEGIN COMMIT, ROLLBACK, END…)
One expresses declaratively in SQL what they want to achieve instead of how it should be achieved. Data dependencies can be expressed explicitly (e.g. „JOIN“), but more often they exist implicitly (e.g. „PRIMARY KEY“, „SECONDARY KEY“, „CHECK“). While this is generally a good idea for data processing (you just describe what the result will be and not how the database gets it), it has been shown in practice this is not sufficient. I will explain more on this later.
SQL has been standardized, but there are multiple versions of the standards, mixing also non-relational aspects (e.g. property graphs), technology-specific aspects (e.g. querying JSON data) and they never covered all possibly variants leading to dialects or new constructs. There is also no testbed to verify that a software meets criteria of certain SQL standards. This means all compatibility claims by vendor are self-proclaimed and cannot be objectively verified.
Additionally, you can use within the SQL languages usually custom functions that can be implemented in any (programming) language in a non-standardized way that implement new functionality not yet part of a standard or never part of a standard (e.g. geospatial operations). This introduced a different paradigm to SQL compared to the aforementioned declarative approach: Now one can add imperative (control-flow) elements to SQL (although largely in a non-standardized fashion). However, as we will see later the imperative (control-flow) approach is also not sufficient as it does not take into account dependencies between data and also can be contradictory to any declarative approaches leading to inconsistency and complexity.
Due to (historic) technological constraints a lot of things have been stuffed into SQL (including its standards) leading to high unnecessary complexity when dealing with complex data flows. SQL itself has never been designed to articulate complex data flow, but simple „transactional-like“ querying and manipulations. This means somehow each data flow (query) exists in isolation without knowing the other one.
Usually an external software is then used to „orchestrate“ execution of different queries in a control-flow order. The problem with this is that data dependencies cannot be expressed explicitly in control-flows – leading to complex control-flows no one understands, inefficient optimisations and to data inconsistencies as well as a lot of bugs.
Dataframe languages are „younger“ than SQL, but they are still a couple of decades old. They are used since years in many different programming languages for modeling and implementing complex data flows – mostly related to analytical data flows. However, nothing stops from using them for all kind of data flows.
In dataframe languages you specify one or more dataframes that refer to the underlying data and you can do operations on them to query or manipulate the data. Contrary to SQL you do not need to specify in advance tables/structures, you can do this when you define and run the script in a dataframe language.
Execution of a set of operations on dataframes can be delayed until they are needed (e.g. when showing the result to the user). This has the advantage that many operations on different dataframes can be optimized much more, as more information is available what the user really needs. This is not possible in SQL – you either execute the full query or not.
Furthermore, this approach allows you also to structure and modularise dataflows expressed in dataframe languages much more easily in many different ways. Thus they can adapted easier to different needs.
Hence, although dataframes have some similarity with SQL.
An example for a data flow articulated in a dataframe language could be:
df_facts := load_iceberg("/facts.iceberg/")
df_companies := load_iceberg("/companies.iceberg/")
df_investment_amounts := load_iceberg("/investments.iceberg/")
df_interests := load_iceberg("/investments.iceberg/")
df_investment_returns := df_investment_amounts * df_interests
df_projected_returns := df_investment_returns
df_projected_returns["year"] + 1
This simple example loads multiple dataframes with different schemas from files (in Apache Iceberg format). It then multiplies the investment amounts with the current interests for the investments and creates a new dataframe that is then stored as files.
This is a very simple example and does not really leverage any advanced dataframe language features. Nevertheless, writing the same in SQL is much more verbose and it is difficult to add further dataflows, e.g. ad-hoc. Furthermore, finding issues in the dataflow is much easier as one can go simply to the statement where the issue occurs. Having this in a complex nested SQL query is much more challenging. I have seen at several occasions where complex dataflows have been attempted to write in SQL (containing single SQL statements of hundreds to thousands of lines) and those were usually more costly to maintain, optimize and debug.
You may think dataframes are simply like programming, but there is a difference: There is no control flow. The statements are declarative, ie they do not need to be executed one after the other. Indeed we will see that in many engines that support dataframe languages execution is delayed as much as possible (lazy-evaluation) and the statements are often executed in parallel and/or different order for optimal performance. Nevertheless, the data dependencies are still kept taken into account. For instance, in the pseudocode example above you cannot multiply amounts and interest before the data is available, but you can calculate „year_realized“ in parallel.
This feature of dataframe languages also allows to change the dataflow during execution (e.g. for interactive development) taking still into account the data dependencies.
As a summary, dataflows expressed in a dataframe language are easy to read, their data dependencies are made explicit and engines interpreting them can execute them in parallel and/or different order for high performance.
A Case Study
The following diagram shows a data flow of our case study. I will in subsequent chapters use SQL and a dataframe language to describe the dataflow. Additionally, I will highlight advantages and disadvantages of both for implementing the dataflow.
The case study is about preparing payment data from various countries for analytics by end users. I simplified this case significantly and in reality it is much more complex. Basically the following things are done:
- Validation of country-specific data
- validation of the country-specific at a global level taking into account also counterparty reference data (e.g company names, location etc.)
- Checking of plausibility of payment data (e.g. negative payments, unusual large payments for given transaction etc.)
- Creating country-specific data marts for analysis by country-subsidaries
- Creating a global data mart with timeseries and aggregation of payment data at a global level
Important: Data flows do NOT express control flows, but data dependencies. This means data processing for a specific data product (e.g country-specific data mart) can start as soon as the input data is available (e.g. country-specific data is validated). Expressing the same data flow using a control-flow approach would have let to a more complex model and much higher design/implementation/maintenance effort.
I now briefly present how you can model/implement the use case using SQL and afterwards using a dataframe language. In both cases it is not a specific SQL dialect or dataframe language, but a simplification.
I will focus only here on creating one called „country-specific validated payment data“. This assumes that payment data for each country has been already somehow imported into one or more tables that can be queried by an SQL statement that validates „country-specific payment data“.
This is of course an extreme simplification of the use case, but it shows already the issues that only become worse the more complex your data flow is.
Let us now assume a very simplistic table with payment data (and ignore any other table that you need to join) for a payment of a „good“ that is bought with a loan:
id, id_of_good, id_of_loan, id_of_collateral, purchase_date, price, currency, vat_tax_rate, payment_date, shipment_date, country.
An example for a data point for this table would be
Now there could be a lot of things that could go wrong with the data, e.g. id in wrong format, id of the good is wrong, price in wrong format/wrong, date in wrong format or shipment on sundays that are not possible with a specific shipment provider, wrong currency/not supported currency/unknown currency, country is wrong/not supported for shipment etc.
I will now construct a SQL query showing only a subset of validation rules for this. The SQL query creates as an output a table about the results of the validation (again simplified) with the following columns: validation_id, id_data_point, validation result
Find here an example how such a table could look like
|Invalid value added tax-rate „1“
Lets see how this use case can now be realized in a simplified version using SQL and afterwards dataframe languages.
Realisation in SQL
One cannot express the whole dataflow as a single SQL statement as a single SQL statement has as an output a single table. However, the dataflow above has multiple tables as output (e.g. „country-specific validated data“, „globally validated data“, „payment datamart global“). This means one needs to create for each data product/table output one dedicated SQL statement.
The following example shows how such a simplified version of a SQL statement with 2 validation rules could look like
SELECT 'v_01_01' as validation_rule_id, id, concat('Invalid good_id: ',id_of_good) as validation_result_msg FROM (SELECT id, id_of_good, id_of_loan, id_of_collateral, purchase_date, price, currency, vat_tax_rate, payment_date, shipment_date, country FROM tbl_raw_payment_data
WHERE id_of_good not IN ('g_01','g_02','g_03','g_04','g_05'))
SELECT 'v_01_1234' as validation_rule_id, id, concat('Invalid value added tax-rate: ',vat_tax_rate) as validation_result_msg FROM (SELECT id, id_of_good, id_of_loan, id_of_collateral, purchase_date, price, currency, vat_tax_rate, payment_date, shipment_date, country FROM tbl_raw_payment_data
WHERE vat_tax_rate not IN (7,9,21))
Another SQL statement would need to be done to calculate how many payments have been done per good. This can be done as follows:
SELECT 'agg_01_01' as aggregation_rule_id, count(id_of_good) FROM (SELECT id, id_of_good, id_of_loan, id_of_collateral, purchase_date, price, currency, vat_tax_rate, payment_date, shipment_date, country FROM tbl_raw_payment_data)
GROUP BY id_of_good
Since this depends on the validation SQL statements one needs to ensure that the validation of the good identifier (id_of_good) has been done before. It does not matter if the vat_tax_rate is valid for the aggregation. However, since SQL cannot express such data dependencies, one needs to establish a control-flow dependency between both. This has the issue that also all other validations are executed beforehand – even if they are irrelevant for the aggregation. This unnecessarily hinders parallelism and thus prolongs the dataflow execution significantly (especially the more such dependencies are there). This can also have a negative impact on error handling as even correct calculations would need to be repeated (which can be also very problematic in complex control-flows).
Also the control-flow model then becomes usually more complex as one needs to translate data-dependencies artificially into control-flow dependencies
The following issues can be identified when using SQL:
- One needs to have for each table output/data product a dedicated SQL statement that somehow needs to be artificially orchestrated using a control-flow language instead of dataflow language.
- Data dependencies between different data products created in the dataflow cannot be properly expressed using SQL.
- Heterogeneous tables (e.g. if each country would deliver slightly different payment data due to national regulations) are difficult to integrate and make the SQL queries more complex.
- Bringing several independent SQL statements, such as validation rules, into one table makes the SQL statement very long, complex, unreadable and difficult to debug.
- One cannot define in SQL where to write the result of the query (ie the derived data product) – except if one previously created the table using a create table statement or inserting into an existing table.
- One needs to still have some orchestration logic between the SQL statements. This is problematic as the orchestration logic cannot take into account data dependencies, but only control-flow dependencies. For instance, you need in another language outside SQL define which SQL statements should follow after the other, but this does not guarantee that they have all the inputs needed to correctly calculate a new data product.
- Further reasons for data inconsistencies are that decisions (e.g. if/then) are modeled outside SQL in the control-flow language and there is no clear guidance when they should be modelled inside SQL or outside SQL. This lead to very differently designed/implemented data flows, which do essentially different things, making maintenance difficult.
- Since no data dependencies are modelled explicitly and different SQL statements exist in isolation to each other they cannot be optimized together – leading to much slower data pipelines and much more compute is needed then necessary. In more complex data flows this is more error-prone and difficult to do in control-flow languages, because this has to be manually designed and cannot be done automatically as the data dependencies are not described in control-flows. For instance, in the simplistic example one could already validate data globally and validate country-specific data without any issues.
Realisation in a Dataframe Language
Contrary to the SQL example, the whole dataflow can be expressed using the dataframe language. One could also to decide to modularise the data flow and reuse modules in different data flows.
Find here how the example would look like in a pseudo dataframe language:
# First rule
validation_rule_df_01["id"] = ~tbl_raw_payment_data["id_of_good"].isin(('g_01','g_02','g_03','g_04','g_05'))
validation_rule_df_01["validation_rule_id"] = 'v_01_01'
validation_rule_df_01["validation_result_msg"] = concat('Invalid good_id: ',validation_rule_df_01["id"])
# Second rule
validation_rule_df_02["id"] = ~tbl_raw_payment_data["id_of_good"].isin(7,9,21)
validation_rule_df_02["validation_rule_id"] = 'v_01_1234'
validation_rule_df_02["validation_result_msg"] = concat('Invalid value added tax-rate: ',validation_rule_df_02["id"])
# Add all to final result
# Simple aggregation
You can see here that it also includes the aggregation. As I will describe later this can be easily optimized to start immediately when the validations for id_of_good have finished and can run in parallel to other independent validations. This feature is possible – despite that all instructions are written sequential – because engines executing dataframe languages can determine based on the data dependencies which calculations can run in parallel.
Although the difference might look not that much, the benefit of a dataframe language is that the whole data flow can be expressed in the same language.
- There is no need to switch to a control-flow language and breaking the dependencies defined in a dataflow.
- You can modularize and structure the dataflow as it makes most sense to you. Not everything needs to be one big mdoule and you can define reusable modules as part of the data flow. This also makes the dataflow easier to understand.
- The dataflow can have multiple input and output data products easily – they do not need to be specified as tables beforehand
- It can optimized by a dataflow engine (see next section) easily as the whole dataflow is available in the same language.
Clearly, the example above is simplified, but one can see already its potential for complex dataflows.
Even though there exist now powerful hardware and engines that can run dataflows across large clusters, there still is room and need for further optimisations/capabilities. Reasons can be, for example, the need for reducing costs or environmental impact.
Examples for these capabilities are:
- Lazy evaluation: Compute is optimised only when there is a need to calculate the result (e.g. when saving it to disk). Within a dataframe language this can be very late, e.g. at the end of hundreds of calculation storing results into multiple different possibly interdependent data products. This is not possible in SQL as you can store the results into one table (data product) and only then the control-flow can execute the next one (* manual optimisations are possible but are tedious/error-prone in complex dataflows).
- Dataframe languages and SQL can both benefit from optimisations (e.g. Apache Spark or Polars) such as cost-based optimisations, indexes etc., but the potential for dataframe languages are larger as there is the perspective on the full dataflow and not only individual elements (SQL queries).
- Since a dataframe language expresses the dependencies between different calculations and data, one can easily distribute computations to different nodes and manage their utilization much better compared to SQL orchestrated with a control flow language. This reduces resource needs (e.g. CPU and memory).
- You can also combine easily dataflows, e.g. a dataflow processing private data only on an end user device and providing aggregated/anonymized data products as input to a larger dataflow running on a large cluster
- You can do static analysis on a dataflow script, e.g. one could suggest optimizations to reduce the script itself to make it more readable or ensure in a simple manner correctness using static typing (static-frame)
- You can use formal methods (e.g. model checking) to prove that your dataflow fulfills certain criteria/soundness (example).
- Syntax of dataframe languages can be made intuitive and reduced representation complexity by providing syntactic sugar (example).
Dataframe languages have currently mainly analytical use cases in mind. However, they can be also very useful for Online transaction processing (OLTP), which has been historically a domain of SQL and thus even for new transactional systems developers use SQL. Nevertheless, the powerful concept of expressing full dataflows is also very relevant in the OLTP world. Transactions could be easier implementable and possible conflicts between transactions can be easier predicted increasing the performance of large-scale OLTP systems. Unfortunately, not much research and innovation has been done in this area yet (some relational databases though add something in this direction, e.g. Function Pipelines in Postgres).
File Formats for Dataframe Language
Dataframes can be facilitated and even more optimised by utilizing specific file formats. Historically this have been comma-separated values (CSV) files, but it has no real definition of different datatypes and stored all data in text format. Thus, running optimised dataflows on them is limited.
With the emergence of Big Data platforms new file formats appeared on the scene, such as Apache Parquet, Apache ORC and Apache Avro. They provided specific features to work with dataframes such as skipping reading of data that is not processed in a dataframe. They also support various other features to facilitate efficient dataframe operations (e.g. Min/Max Index, Bloom filters, compression or reading of individual columns without reading the full rows).
Several years ago novel Big Data formats, such as Apache Iceberg, Apache Hudi or Deltalake appeared. They added new capabilities relevant for dataframe languages, such as optimising individual inserts/updates/deletes, large amount of partitions, in-place schema evolution, optimised placement of data according to needs and many more. These formats are used more and more in production, but have a higher complexity to manage compared to the previously mentioned formats.
Further optimisations might be possible in future file formats, e.g. spreadsheet like data frames or mathematical objects (e.g. tensors).
SQL has a rich support for various engines and technologies (e.g. relational databases), which I do not repeat here.
The focus of this section is the support of dataframe languages. Often they are tightly coupled with an engine. Examples:
- Python Pandas: Targets small to medium datasets that can be easily processed on a single core
- Python Polars (based on Rust pola.rs): Targets medium to large datasets that can be easily processed on multiple cores
- Python Dask or Python Ray or Apache Spark: Targets medium to large datasets that are processed on a compute cluster
- R dataframes: Dataframes are a core concept in the R programming language
- Julia DataFrames.jl: Dataframes are a core concept in the Julia programming language
Modin for Python also targets medium to large datasets that are processed on a compute cluster, but has a theoretical foundation facilitating some of the capabilities mentioned before. Furthemore, modin supports different underlying engines in Python.
The Validation and Transformation Language (VTL) by the Statistical Data and Metadata eXchange (SDMX) is a platform independent dataframe language that is interpreted by an engine (e.g. Trevas to run it in Spark or in-memory). This is an interesting approach as it is not bound to a specific engine, but allows engine-independent optimisations and analysis. Thus it can be seen also as a generalization of Modin.
Often you find also the possibility to mix the two. This is often the case when in a dataflow you need to access a data source that only supports SQL (e.g. a relational database) and then do more advanced transformations in the dataframe language.
Reporting/Dashboard tools often only support traditional SQL, but in large datasets they show lower performance as SQL queries are executed every time after an interaction with the user. They also cannot easily combine many different datasources. Here dataframe languages may show a lot of potential to improve.
Currently, both query and dataframe languages co-exist and are also used in the same dataflow. This has simply historical reasons. Data frame languages are much more powerful to model data flows as they can express data dependencies between data access and manipulation. Furthermore, it is also easier for a human to understand and read how data was extracted and manipulated in a data frame language. SQL requires control-flow orchestration of dataflows, which is inefficient, can lead to inconsistencies, unnecessary complex flows and data quality issues. Hence, over a longer period of time data frame languages may be more successful and widely used compared to SQL (which still has its use).
However, data frame languages – despite being there for decades – have by far not reached their full potential yet. Especially tools that have been based on SQL/NoSQL or similar languages for structured data (e.g. dashboards/interactive analytics, professional data production flows, online transactional processing (OLTP) etc.) could benefit from the performance, interactivity, auditability (including data lineage) and soundness features of data frame languages.
I hope that we will have in the near future a more solid theoretical basis for data frame languages on how they can ensure consistent data flows reducing the complexity burden. Surprisingly the body of research on dataframe languages is rather small compared to SQL, so this would be a first step forward: define based on existing dataframe languages in practice what are their general properties, how they can be used to reduce complexity of expressing data flows, improve management of data flows (e.g. evolution of data structures) and leading to less issues with data. Luckily this work as already started in recent years.
Finally, platform-independent dataframe languages may prove as very useful to run them without changes in many different engines or possibly different parts of the same dataflow in different engines to benefit of their features. Here, also work started, such as Modin or VTL.