There are multiple references about the relational algebra for modeling queries (SELECT
) but I have found very very little on the expanded algebra that would include concepts in all of DML such as INSERT
, UPDATE
, DELETE
and maybe even MERGE
, DECLARE
, SET
(as defined by MS SQL Server, whenever I must choose). I'm looking for information about such an expanded algebra to help me reason about DML operations.
I assume that MS SQL Server, for example, must have some meta-domain model and/or type system that helps it validate SQL Scripts. Here are a few of SQL SERVER 2008's SYS.MESSAGES
(issued for scripted DML operations) that suggest to me violations of a type or domain-rule nature, not syntactic:
286 - The logical tables INSERTED and DELETED cannot be updated. 417 - TOP is not allowed in an UPDATE or DELETE statement against a partitioned view. 10729 - A nested INSERT, UPDATE, DELETE, or MERGE statement is not allowed in a SELECT statement that is not the immediate source of rows for an INSERT statement. 354 - The target '%.*ls' of the INSERT statement cannot be a view or common table expression when the FROM clause contains a nested INSERT, UPDATE, DELETE, or MERGE statement. 4005 - Cannot update columns from more than one underlying table in a single update call.
The database engine is compiling and analyzing the scripted commands it is given and emitting these messages in response to violations against its domain.
An engine executing this following statement isn't merely tacking on something trivial to the join of Table1
and Table2
. Beyond what it needed to SELECT
, for the UPDATE
it has to track more information throughout the statement such as the actual tables that ColB
and ColC
in the resulting join came from so that 1) it knows that they are from the same table (Error #4005) and 2) so it can actually write the results to somewhere permanent (not just a local transformation of the join):
UPDATE Table2 SET ColB = Table2.ColB + Table1.ColB, ColC = 3 FROM Table2 INNER JOIN Table1 ON Table2.ColA = Table1.ColA WHERE Table1.ColX < 100
I've never used jOOQ, but the documentation suggests to me an uncommon level of sophistication/understanding compared to software with similar goals. It apparently includes a fluent-style API for DML, and assuming that some of the library's value is in its ability to detect certain errors at compile-time, this suggests to me a decent domain model of DML. Are jOOQ developers breaking ground with new and proprietary ways of understanding DML operations or is there published research into DML/CRUD types that they're implementing? I'm so far overwhelmed trying to reverse-engineer the domain models from an implementation but I cannot find such a pure reference.
What is the domain model that is behind DML engines like SQL Server's script execution engine or softwares like the jOOQ library enabling them to understand/analyze DML scripts? For sql statements, relational algebra gives mathematical concepts of views, selections, projections, various joins, aggreggations, windowing, grouping, pivoting and each concept has type calculus that it dictates. (Perhaps that a selection over a view yields a view of the same type with equal or fewer elements, that join takes two views and a join condition and yields a view of tupled elements, etc). There are several implementations including Linq-To-SQL, Entity Framework which support (to varying degrees of competency) an alternate encoding and code translation of queries. I assume that those softwares are grounded in the domain model of relational algebra. But what would be the same basis for a library that supports full DML reasoning?
Thanks!
Asked By : Jason Kleban
Answered By : Lukas Eder
Excellent question, and since you referred to us ("jOOQ developers", which I am - working for the company behind jOOQ), I feel qualified to give a partial answer.
A bit of historic context first
Since the very beginning of software, there had been:
- Theory (which is what "Computer Science", i.e. this Stack Exchange subsite is about)
- Practice (more like Stack Overflow, i.e. "how can I get the compiler to stop barking at me")
The origins of set theory, relational algebra, and relational calculus were clearly on the theoretical side. Up to this day, popular RDBMS implementations have only managed to approximate this theory, mostly through SQL.
Although there had been competing languages before SQL was standardised mainly by Oracle and IBM, the power of SQL lay in the fact that it was a very practical language, which worked sufficiently well both for computer programs (e.g. written in COBOL, which was an equally practical language), software engineers, and most importantly: "other people", including analysts, project managers, and also managers. Everyone can quickly write up a SELECT * FROM person WHERE first_name = 'John'
query. More theoretical languages like QUEL, which would have been closer to relational calculus, lacked the business vision that Oracle had.
Since then, SQL has emancipated from the outdated relational model and embraced a lot of alternative models, including procedural extensions (stored procedures), OLAP features, XML, JSON, etc. Not all of these enhancements were very well executed. To this day, the XML integration, for instance, is rather weak, poorly standardised and hard to grasp. I'm just speculating that the ANSI/ISO/IEC SQL standards committee and the w3c didn't work together closely enough to truly embed XQuery into SQL at the time. Unfortunately for w3c, as hardly anyone is using XQuery today, while everyone still uses SQL.
The historic irony here is that a lot of innovation came from academia, and I'd like to specifically mention Michael Stonebraker (Quel, Ingres, the original Postgres, Vertica, H-Store, and much more) who was recently awarded with the well-deserved turing award. Stonebraker has always been innovating into new areas. Most of the recent in-memory column store "hype" (also often referred to as "NewSQL") evolved around him. Stonebraker is a good example of the difference between academia and business. We've blogged about this some time ago here:
- http://blog.jooq.org/2013/08/24/mit-prof-michael-stonebraker-the-traditional-rdbms-wisdom-is-all-wrong (very interesting talk by Stonebraker)
- http://blog.jooq.org/2013/08/27/column-stores-teaching-an-old-elephant-new-tricks
When you hear the excellent talks by Oracle's Andrew Mendelsohn for instance, or other influencers from companies like Microsoft, SAP, IBM, you will notice that business has always embraced SQL for very good reasons, which were often not at all academic.
Context for your question
You have to differentiate between around four levels of abstraction:
- Relational calculus
- The SQL standard
- The SQL Server specification
- The SQL Server implementation
Your example is a particularly interesting one, because UPDATE .. SET .. FROM
is not part of the latest SQL:2011 standard. To our knowledge, the UPDATE .. FROM
clause is supported in at least:
- SAP HANA
- Ingres
- Postgres
- SQL Server
- Sybase
Other databases allow for updating views, e.g.
UPDATE t1 JOIN t2 ON ... SET ...
UPDATE (SELECT * FROM t1 JOIN t2 ON ...) SET ...
In terms of Codd's Rules, the following two would apply:
Rule 6: The view updating rule:
All views that are theoretically updatable must be updatable by the system.
Rule 7: High-level insert, update, and delete:
The system must support set-at-a-time insert, update, and delete operators. This means that data can be retrieved from a relational database in sets constructed of data from multiple rows and/or multiple tables. This rule states that insert, update, and delete operations should be supported for any retrievable set rather than just for a single row in a single table.
So, Codd demands that an RDBMS can update views using sets, but he wouldn't want to specify how to actually do that (e.g. syntax-wise).
In fact, the concrete rules which apply to these kinds of statement are rather different from database to database. This has nothing to do with the SQL language, or with relational calculus per se, but simply with the individual implementations of each database - often also because of how the locking / concurrency model is implemented - something that is left open to implementations by both Codd's Rules as well as to some extent by the SQL standard. For instance, the standard does not explicitly refer to MVCC.
In the particular case of UPDATE .. FROM
, think of it this way (assumptions only):
- Some customer probably wanted to be able to write
UPDATE
statements that take into account more than one table - Some database vendor then implemented a "reasonable" syntax that fits into the general feeling of writing SQL
- Some other database vendor then copied this syntax in order to provide migration paths, but "improved" on syntax and/or semantics
- Yet another database vendor (probably IBM or Oracle) came up with yet another, much more powerful syntax for such use-cases, namely
MERGE
As that database vendor has closer ties to the ANSI/ISO/IEC standards committee, and as this syntax was clearly more useful,
MERGE
now became a standard, and is now implemented in more and more RDBMS. To our knowledge in at least:- CUBRID
- DB2
- Firebird (upcoming 3.0 only)
- HSQLDB
- Informix
- Oracle
- SQL Server
- Sybase SQL Anywhere
Answer to your questions
Are jOOQ developers breaking ground with new and proprietary ways of understanding DML operations [...]?
We're not really breaking ground here. Before jOOQ, there had been many many SQL parsers and translators / migrators that allowed for standardising some vendor-specific SQL features. One example is Gudu Software's SQL Parser - the use case being migrating off DB2 towards Oracle, for instance.
What's new in jOOQ is that we introduced another "SQL dialect" implemented directly in Java, which mimicks a bit of the SQL standard and a bit of popular vendor-specific SQL dialects. Since jOOQ users explicitly create an Abstract Syntax Tree in their Java programs by using jOOQ API, jOOQ can easily perform a lot of SQL transformation internally in order to standardise the generated SQL output for all currently supported ~20 RDBMS, without the need to previously parse SQL statements (which is extremely hard). This extends to DML statements only to a certain degree, of course, as the underlying database implementation still needs to be able to process the semantics of a given statement.
What's also new (or at least different) in jOOQ is that unlike HQL, JPQL, LINQ, and many other abstractions, jOOQ is sticking to SQL instead of embedding also other concepts / domains like objects (ORMs) or XML/Collections (LINQ)
Or is there published work on this?
jOOQ is not academic work. It evolved from real user needs over the past 6 years. We're following SQL here as we want to follow real business needs. We're not aware of third-party publications about jOOQ, although there had been a lot of similar papers about other, similar APIs in the past:
- The World According to LINQ (and many other publications) by Erik Meijer
- Embedded Typesafe Domain Specific Languages for Java by Jevgeni Kabanov and Rein Raudjärv
- Database Queries in Java by Christopher Ming-Yee Iu (see also our article)
Embedding an external language like SQL in Java is rather easy, though. The basic technique is published in our article
I'm so far overwhelmed trying to reverse-engineer the domain models from an implementation.
So are we :-)
As you can see, SQL is a very real-world, pragmatic, sometimes (often) a bit ugly, vendor-specific (although well standardised), huge thing. The SQL:2011 standard's main document has almost 1500 pages, including DML (you can find some publicly available draft document versions here, for free). This doesn't even include extensions like XML, the INFORMATION_SCHEMA
or other extensions.
Most people I've talked with in the past would agree that we really need something better than SQL. But the ugly truth is, there is nothing better than SQL right now, and even if there were, it would have to be backwards-compatible. There is no way that our whole industry will stop using the excellent and huge tool-chain that has been built around SQL. If "NoSQL" databases want to be successful these days, they build SQL-esque langauges (e.g. Cassandra with CQL, Couchbase with N1QL, Neo4j with Cypher). Oracle shows that SQL-on-Hadoop (BigDataSQL) is possible, and SAP's HANA completely bypassed NoSQL and implemented their number-crunching in-memory database directly on top of SQL.
The flip-side of this focus on practicality is that the academic quality of something like UPDATE .. FROM
deteriorates to a point where you simply shrug and accept that this is merely a useful, perhaps a bit weird SQL statement.
TL;DR
If you want to understand the individual error messages that your SQL Server instance displays to you, then you're probably best off learning about SQL Server first. These concrete, real-world problems are so far away from the original theory ("Codd's Rules", "Tutorial D") that you'll be much better off bypassing the theory for these cases.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/43672
0 comments:
Post a Comment
Let us know your responses and feedback