|
|
Just some thoughts of mine
-
As a concise example of the evils of type coercion I came up with the following: INSERT INTO t VALUES ( CAST(2 AS INT), CAST(2 AS INT)); SELECT CASE WHEN x + z = 4 THEN '4' ELSE 'NOT 4' END xPLUSz FROM t; Result: xPLUSz ------ NOT 4 It is left as an exercise for the reader to work out what types x and z might be!
|
-
The Relational Data Model is 40 years old. Edgar Codd first described the idea that databases could be constructed from mathematical relations in an internal IBM Research report in August 1969. The following year his more famous paper on the same topic was published for public consumption. So began a revolution. Data was liberated from its physical representation and thereafter the most complex data processing problems could be solved using only a few simple building blocks (the eight basic operators of Codd’s relational algebra). Few concepts in the history IT have made such an impact or endured for so long.

At this point I could try to write something profound about what the next 40 years might hold... I’m not foolish enough to predict anything like that far ahead though! For sure there is a lot more that could be done. The SQL systems that most of us use are still some way short of what Codd envisaged even four decades ago. It seems pretty clear that technology already exists to make possible much better database systems than we have today. Solid-state storage for example should make it possible to achieve far greater Physical Data Independence (database representation that is more abstracted from its internal storage structures). Richer type support in databases is something that database vendors have been improving but there is still some way to go – in particular type-inheritance isn’t well-supported by most DBMSs. Self-optimising databases is one area where I expect we’ll see a lot more innovation. And then of course there is ever more scalability and distributed database technology, which brings me to the cloud and distributed databases…
My potentially subversive question is this: Do we actually need new data models for the cloud? It seems that we do need different and hopefully better implementations of database systems than we have in centralised data centres. We will go on using XML stores and other types of storage too. But the relational model still ought to have a central role to play in cloudy, distributed data stores as well as the centralised ones.
Brewer conjectures that we may need to accept partial or eventual consistency of data if we desire a system that’s both available and partitioned. Two observations about this seem important. Firstly, consistency guaranteed at some point will continue to be critical for the vast majority of non-trivial systems. No business is ever giving up consistency and accuracy as desirable goals because trust is the basis of every business model out there. Secondly, eventual consistency means that fewer business rules are implemented in the declarative manner that we are used to, i.e. they are evaluated some time after the event rather than at transaction time.
This could actually suit the relational model very well. Because the RM doesn't have any navigational structures it can easily accommodate data that is late arriving. Provided no constraint is actually broken (as if we aren’t using declarative referential integrity for example) it is perfectly OK for some tuples to arrive “late” in our relational model. We can still query and join on the tuples that are present, we can derive place-holders for tuples that are missing (think outer joins) and we can apply integrity rules to data when it does arrive. This is in stark contrast to navigational models (like XML and other hierarchies) where an element must exist before its child elements can exist.
Another desirable property of large distributed systems is a flexible and dynamic structure. RM is good here too. In principle relational schemas can be highly dynamic but unfortunately this is one area where SQL systems are found wanting. In a relational system, changing the set of attributes of a relation ought to be as simple as writing the query that transforms the data and then assigning the result of that query to a new relation variable. Potentially this needs to involve little or no change to the underlying storage and can therefore be a near-instantaneous metadata operation. SQL DBMSs make extremely heavy work of such a simple task however, which is why SQL databases are often difficult to change and adapt.
Yet another thing on the distributed storage wish-list is rich type support – support for storing documents, images, XML and other objects in databases just as easily as strings and numbers. As already mentioned, SQL systems are doing some of that already and vendors will no doubt continue to improve type support.
The many familiar and valid criticisms of SQL (see for example the “NoSQL” line of thought) are just that – problems with the SQL model and SQL DBMS limitations rather than with the relational model. (See also NoSQL and the Relational Model: don’t throw the baby out with the bathwater). So even after 40 years I still think the relational model is the coolest game in data-land, and perhaps the best is still to come because RM’s potential has never yet been fully realised in mainstream database software. Kudos Mr Codd!

|
-
UPDATE FROM is dead … or it should be. Hugo Kornelis has done a good job of explaining why. Apart from a badly thought out and non-portable syntax, UPDATE FROM suffers a notorious bug/feature. I say “notorious” but it’s probably not notorious enough because it is still common to see people get tripped up by it. If I write: UPDATE Alpha SET z = Beta.z FROM Alpha JOIN Beta ON Alpha.x = Beta.x ; and if Beta.x is not unique then Alpha.z will get updated with some random value from one of the matching rows in Beta.z. This is documented in Books Online as follows: “The results of an UPDATE statement are undefined if the statement includes a FROM clause that is not specified in such a way that only one value is available for each column occurrence that is updated, that is if the UPDATE statement is not deterministic.” MERGE handles the same situation more wisely and safely: it returns a runtime error if the join is non-deterministic: MERGE INTO Alpha USING Beta ON Alpha.x = Beta.x WHEN MATCHED THEN UPDATE SET z = Beta.z ; Msg 8672, Level 16, State 1, Line 42 The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows. MERGE is also standard SQL, works in other DBMSs and frequently results in more efficient execution plans than UPDATE FROM. UPDATE FROM isn’t originally Microsoft’s fault because it’s a legacy of SQL Server’s ancient historical roots at Sybase. But it is Microsoft’s fault they haven’t deprecated it yet. You can vote for deprecating UPDATE FROM on Connect: http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=332437
|
-
DB2 has some new features that are aimed at customers who want to transition from Oracle to DB2: http://www-05.ibm.com/ch/events/symposium/pdf/9_M_Obrecht_IBM_Symposium_2009.pdf IBM are emphasising the improved snapshot isolation, PL/SQL support and packages (“modules” in IBM’s terms). One “feature” that caught my eye is support for implicit conversion. This is a surprise to me. In most areas of software engineering strong typing is considered not just desirable but pretty essential to program correctness and early identification of bugs. I don’t think that reducing or removing type safety is a good way to improve database languages. In SQL Server, Oracle and other DBMSs we already have to live with implicit typing (type coercion) and the problems it causes. SQL Server won’t complain at compile time if I write the following two lines in the same procedure or script: SELECT col1 FROM tbl WHERE col2 = CAST(1 AS INT); SELECT col1 FROM tbl WHERE col2 = CAST('2' AS VARCHAR(10)); The effect is that SQL Server recklessly and silently casts strings as numbers and won’t even give a runtime error until the data itself causes the conversion to break. Getting it right therefore requires much more effort than it would do if types were checked at compile time. Erland Sommarskog has posted a feedback item requesting that stricter compile time checking be incorporated in a future version of SQL Server. Please rate it if you agree: http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=260762
|
-
Whether and when to upgrade to SQL Server 2008? It seems that there is still a school of thought that the release version of a product is not to be trusted and that the best policy is to "wait for SP1". Now this has never made much sense to me, not least because Service Packs can mean quite different things in different products. Sticking with SQL Server 2008 though, I believe there are plenty of reasons to upgrade now and NOT to wait for SP1. Firstly, SQL Server 2008 is an extremely high quality and stable product and has already notched up a great many server-years of collective testing, both before release and after. I personally have been developing and testing with it on projects for a year now and it has been the best quality preview version of the product I've ever worked with. SQL Server 2008 RTM is in very much better shape now than SQL Server 2005 was at this point in its life. Secondly, it has to be said that the history of SQL Server Service Packs is not an entirely happy one. There have been some very unfortunate problems with SQL SPs. For sure, the SQL Server team will learn from their mistakes but the fact is that an SP is never going to get the level of testing at customer sites that a major release gets. This is natural enough because SPs rarely contain new functionality so it's harder to justify the investment needed to build and test with the beta versions of SPs. In terms of quantifiable risk then, installing a 2008 service pack right after release is potentially far more risky than installing SQL Server 2008 RTM right now. Thirdly, there are the real returns you could get from upgrading to 2008 even with minimal effort. Upgrading from 2000 to 2005 was and is a big undertaking. Between 2000 and 2005 there is a whole raft of new features, deprecated features, breaking changes - all of which adds up to a lot of regression testing and potential development effort to take advantage of new features. In comparison, the additional effort to move from 2005 to 2008 or from 2000 to 2008 is a far smaller piece of work. Fewer new things will be broken or need re-working. Plus you can take advantage of new features like compression and policy based management with zero change to existing applications. Andrew Fryer says that you may have to wait a long time for a SQL 2008 SP. At the UK User Group meeting last night he also had a suggestion. Even if you aren't upgrading your servers yet, at least install and use 2008 Management Studio so that you can take advantage of the new features in the tool!
|
-

SQL Server 2008 is out! All editions are on MSDN now except for Express, which should hopefully follow soon. Enjoy!
|
-
The MERGE syntax I used here has changed. In 2008 RC0 the "WHEN SOURCE NOT MATCHED" clause is replaced with "WHEN NOT MATCHED BY SOURCE" which is supposed to make the meaning clearer.
Although MERGE is part of standard SQL 2003 it's worth noting that the SOURCE NOT MATCHED / NOT MATCHED BY SOURCE clause is not part of the standard. It's an extension added by Microsoft - a very useful one in my view.
MERGE INTO a USING b ON a.keycol = b.keycol WHEN MATCHED THEN UPDATE SET col1 = b.col1, col2 = b.col2, col3 = b.col3 WHEN NOT MATCHED THEN INSERT (keycol, col1, col2, col3) VALUES (b.keycol, b.col1, b.col2, b.col3) WHEN NOT MATCHED BY SOURCE THEN DELETE;
|
-
 SQL Server 2008 Release Candidate 0 (build 10.0.1442.32) was released to MSDN and TechNet subscribers on Friday. It should be available to everyone else very soon via the Connect site. http://msdn.microsoft.com/en-us/subscriptions/default.aspx http://technet.microsoft.com/en-us/subscriptions/default.aspx Apart from the new branding (see above) the goodies in this release include some Management Studio enhancements: object search, T-SQL debugger and PowerShell support. Yes, the debugger is back for those who miss it from SQL Server 2000 days! What's also new is that multiple editions are now included in a single package, so you get Express, Standard, Enterprise, etc instead of just Developer. A product key is required to install the paid-for editions. SQL Server Express Edition 64 bit I have seen no info at all from Microsoft about this but RC0 does include a 64 bit version of Express Edition. This is the first native 64 bit release of Express. SQL Server 2005 Express Edition is 32 bit and WOW only. SELECT @@VERSION; -------------------------------------------------------------------------------------------------------- Microsoft SQL Server 2008 (CTP) - 10.0.1442.32 (X64) May 29 2008 23:52:04 Copyright (c) 1988-2007 Microsoft Corporation Express Edition with Advanced Services (64-bit) on Windows NT 5.2 <X64> (Build 3790: Service Pack 2) (1 row(s) affected)
|
-
According to InfoWorld, Bill Gates thinks that declarative modelling of business logic has not caught on because of "weak models". This is a surprise. E.F.Codd's earliest papers on the Relational Model described a relational calculus for the purpose of supporting declarative business logic. Even the SQL standards committee added some powerful features to standard SQL, such as CREATE ASSERTION and generalised CHECK constraints with subqueries. Despite this, InfoWorld identifies the Relational Model as one of the "weak" models referred to by Gates! What is stopping people putting declarative business logic in their data models today? The answer is not the model. It is the DBMS products that are supposed to implement the model. Here's what Terry Halpin said about declarative constraint support in SQL back in 2001 "Although the above SQL syntax [CREATE DOMAIN and CHECK constraint with subquery] is legal as far back as SQL-92, some commercial versions of SQL do not yet support all of this syntax. For data definition, some versions have barely progressed beyond the old SQL-89 standard (which had no domain clauses, and restricted check clauses to conditions on a single row). In practice, some features of a relational schema may need to be specified as a procedural code rather than declaratively."
(Information Modeling and Relational Databases, p411) In 2000, Fabian Pascal wrote: "most commercial DBMSs have not implemented the [SQL] standard's advanced level integrity features. Some implement integrity procedurally via user-written stored procedures expressed in proprietary SQL extensions that can be quite complex, vary across products and have various limitations."(Practical Issues in Database Management, p64) And here's my take on the same problem: http://blogs.conchango.com/davidportas/archive/2007/02/19/Trouble-with-CHECK-Constraints.aspxDatabase practitioners have been saying for years that most SQL DBMSs do a poor job of supporting declarative business logic. Dear Bill: the relational model is just great. Please fix the omissions in the current products before going on to invent whole new ones!
|
-
-
MERGE is a new DML statement in SQL Server 2008. Microsoft have implemented the ISO SQL 2003 and 2007 standard MERGE statement (as seen in Oracle and DB2) and added some extensions of their own. In a nutshell, MERGE allows you to perform simultaneous UPDATE, INSERT and/or DELETE operations on one table. There are new physical operators that combine these operations so that they can be performed in a single scan rather than multiple scans. MERGE has loads of possible applications. For the first time you can assign the contents of one table or query to another in a single operation. The following example requires SQL Server 2008 CTP4. Given this schema and data: CREATE TABLE a (keycol INT PRIMARY KEY, col1 INT NOT NULL, col2 INT NOT NULL, col3 INT NOT NULL); CREATE TABLE b (keycol INT PRIMARY KEY, col1 INT NOT NULL, col2 INT NOT NULL, col3 INT NOT NULL); INSERT INTO a VALUES (1,0,0,0),(2,0,0,0); INSERT INTO b VALUES (1,1,1,1),(3,3,3,3); The following MERGE will populate table a with the same data as table b: MERGE INTO a USING b ON a.keycol = b.keycol WHEN MATCHED THEN UPDATE SET col1 = b.col1, col2 = b.col2, col3 = b.col3 WHEN NOT MATCHED THEN INSERT (keycol, col1, col2, col3) VALUES (b.keycol, b.col1, b.col2, b.col3) WHEN SOURCE NOT MATCHED THEN DELETE; In the relational world this is the operation known as Relational Assignment ie: a := b Unfortunately the SQL syntax is less pretty and requires just a little more typing! MERGE also makes a good "upsert" for application CRUD stored procedures, removing the need for constructs like: IF NOT EXISTS ... INSERT ... Here's an example I created today. It inserts a new Vendor if and only if the name doesn't already exist. Whether the name previously existed or not, it returns the IDENTITY value of the existing or newly inserted row. CREATE PROC dbo.usp_VendorUpsert ( @pVendorID INT OUTPUT, @pVendorName VARCHAR(80) ) AS BEGIN SET NOCOUNT ON; MERGE dbo.Vendor t USING (SELECT @pVendorName ) p(VendorName) ON t.VendorName = @pVendorName WHEN NOT MATCHED THEN INSERT (VendorName) VALUES (@pVendorName) WHEN MATCHED THEN UPDATE SET @pVendorID = VendorID; SET @pVendorID = COALESCE(SCOPE_IDENTITY(),@pVendorID); END RETURN It's amazing that it took nearly 20 years for the SQL standards committee to come up with MERGE. Perhaps the delay is a legacy of the decision to make INSERT, UPDATE and DELETE the basic data update operators. INSERT, UPDATE and DELETE can all be defined as different kinds of relational assignment - assignment being the most basic type of update possible. So arguably MERGE is the more primitive and fundamental data update operator that ought to have been around earlier rather than later.
|
-
According to Computerworld, Mike Stonebraker says "RDBMSes, are 'long in the tooth' and 'should be considered legacy technology'". Knowing Stonebraker's work and reputation that seemed like some pretty startling information! It was less of a surprise to find that Computerworld hadn't quite got it right though... You can read Mike Stonebraker's actual words at www.databasecolumn.com. He makes it clear that he's talking about the technology used for current DBMS implementations, not about the relational model or the SQL model itself. The columnar technology that Stonebraker has championed for years isn't an alternative data model at all; it's a particular mode of storage that in principle could be used by any DBMS, whether relational or not. As far as I know it has been used pretty much exclusively in SQL DBMSs (Sybase IQ for example). Column store is perfect for BI and analytics. What would be even nicer would be to have the choice of column store or row store (or both) in a single DBMS. In fact I'd like to discard altogether the assumption made by most major DBMSs - that a base table automatically defines the main unit of storage. It ought to be possible to map any selection, projection or join of tables into their own storage and to create redundant storage structures where necessary to support the best optimisations for different queries. Today, we have to use three different structures to achieve those things: partitions are a means of storing selections, indexes store projections, and "materialised" views are a slightly roundabout means of persisting joins (why do we have to create a view just in order to change the internal type of storage?). Each of those techniques have their own particular advantages and disadvantages.
|
-
One of my regular gripes about SQL Server concerns the number of Transact SQL features that give nondeterministic results. Unfortunately SQL Server 2005 added to the list of those. One of the new ones is ROW_NUMBER(), which is strictly due to ANSI/ISO rather than Microsoft. ROW_NUMBER() is inherently non-deterministic unless the PARTITION / ORDER BY criteria define a unique criteria for each row - something that isn't mandatory. What I think is surprising however, is Microsoft's approach to updateable views and CTE's, which gives an interesting twist to the effect of ROW_NUMBER(). For example, ROW_NUMBER() can be used to solve the problem of removing duplicates from a set and it's undeniably useful for that purpose. Here's an apparently simple example. The goal is to delete all but one row for each distinct value of Z. CREATE TABLE tbl (x INT NOT NULL PRIMARY KEY, z CHAR(1) NOT NULL); INSERT INTO tbl (x,z) VALUES (1,'A'); INSERT INTO tbl (x,z) VALUES (2,'A'); INSERT INTO tbl (x,z) VALUES (3,'B'); INSERT INTO tbl (x,z) VALUES (4,'B'); WITH t(r) AS /* Updateable CTE */ (SELECT ROW_NUMBER() OVER (PARTITION BY z ORDER BY z) FROM tbl) DELETE t WHERE r > 1; The problem is: which rows get deleted? The CTE doesn't contain a key, in fact it doesn't contain anything except the ROW_NUMBER() in column "r". The ORDER BY clause doesn't reference a key either which means the number returned for each row is nondeterministic. Yet despite these facts the CTE is still updateable. I was surprised when I first discovered that the DELETE statement above is legal. I was even more surprised to find that Books Online is totally silent on the point of exactly when a CTE is updateable and when it isn't. The lack of documentation is a serious omission because those who have been using SQL Server for a few years will know that undefined behaviour tends to get dropped or changed between releases. Right now, my habit is that if I need to update a CTE then I make sure it contains a key. Usually it makes sense for the OVER... ORDER BY clause to use a key also. Like: WITH t(x,r) AS /* Updateable CTE */ (SELECT x, ROW_NUMBER() OVER (PARTITION BY z ORDER BY x) FROM tbl) DELETE t WHERE r > 1; (note the difference: SELECT x, ... ORDER BY x).
|
-
-
Alice: What do you mean by practical languages? Riccardo: select from where. Alice: That's it? Vittorio: Well, there are of course lots of bells and whistles. Sergio: But basically, this forms the core of most practical languages. [1] Oracle is celebrating its 30th birthday this month. They didn't adopt Oracle as the company name until 1982, before which they were known as Relational Software Inc. Of course their very excellent, landmark database product is also "relational" by name and reputation but strictly non-relational in fact. All SQL DBMSs borrow certain features and terminology of the relational model while remaining fundamentally anti-relational in many important respects[2]. IBM began by declaring that its SQL language was in fact "relational" and the name has stuck.
Once or twice I have floated the idea that a new relational database language should be added to SQL Server as an alternative to the T-SQL language. In my opinion the time is right for it. SQL Server already has several data access languages and layers of abstraction - such as SSAS/MDX, LINQ, SSIS - so adding another one isn't exactly a radical idea. What I'm talking about would be a relational language that lives up to Date and Darwen's description of a "D" language. Naturally the new language would imply a new data model as well - the relational data model instead of the SQL one - but I don't think it would be difficult to build it on top of SQL Server's existing engine. In fact one of the key benefits would be that the engine could be freed of certain restrictions imposed by SQL (for example the requirement to support duplicates and column order). Implementing a new language would be a much more modest step than it was to add the OLAP Services engine into SQL Server 7.0 back in 1998. I was recently asked for some examples of problems solved by a true RDBMS that aren't solved already by SQL Server (or other SQL DBMSs for that matter). A full list would be pretty long! Among other things an RDBMS would deliver a truly unified model suitable for both OLTP and OLAP applications. Performance and storage improvements due to better data independence. Support for declarative constraint logic. A more powerful and generic type system. A relational language would also fix some of the more irritating limitations that waste so much SQL programming time today. Duplicate rows and the consequent failure to support guaranteed access. No relational assignment or comparison operators. No multiple assignment. Three-value logic. Is it going to happen? I don't know, but I do think the combination of today's business demands and today's technology make it a real possibility - perhaps more so than at any time in the last 30 years. SQL was a data access language for the 1980's but a 21st century replacement is well overdue. Comments and feeback are welcome! -- [1] From "Foundations of Databases", Chapter 7, by Abiteboul, Hull and Vianu. Addison-Wesley [2] There are numerous references that cover the details. I can recommend "Database in Depth" by Chris Date. "The Askew Wall" by Hugh Darwen, republished in "Relational Database Writings 1989-1991" explains the origins of SQL and some of its limitations.
|
|
|
|