Welcome to EMC Consulting Blogs Sign in | Join | Help

James Saull's Blog

The ethical slacker

Partition, Federate, Aggregate and Synchronise

It is common practice to use a partitioning function to scale out a solution. For example, you may divide a database among four equal geographical regions and have four separate installations of the database each with the same database but only a quarter of the data. When these four databases outgrew the single physical infrastructure they could be split out on to  separate physical machines with a region on each. When throughput of each machine was in jeopardy you could change the partitioning scheme to be more granular - e.g. sub regions. With 40 sub regions you could house 4 sub regions per physical server and have 10 physical servers. And so on and so on. Clearly the application tier needs to be aware of the partitioning scheme and be able to resolve the request it is dealing with to the server that is able to process it. DNS is an example of this resolution by resolving the partition to the handling resource. E.g. www.conchango.com resolves to an IP address which directs your resource request to the appropriate resources.

Remember the partitioning scheme should take in to account how the data is used and not just logical groupings. Ideally your architecture will include the ability to easily re-partition.You may be motivated by reducing the sheer data volumes - solved by creating many smaller databases devoted to a much more manageable data partition. You may be motivated more by sheer TPS (Transactions Per Second) with a perfectly reasonable data volume - solved perhaps by load balancing across identical servers. Ease of re-partitioning allows you to identify hot areas of data attracting activity and break that into many partitions so that you can give each server equal amounts of hot and cold partitions. These hot/cold areas may not be fixed for all time...

Anyway my point is not about comparing all the possible schemes! Get to the point!

Either way you slice it you introduce two key problems...

Aggregate. You are likely to have queries that span partitions. Typically this would be a reporting requirement but sometimes it can be part of the OLTP solution. Spanning partitions might mean different tables, different databases, different servers, different continents... ETL/Data Warehousing techniques have solved this problem.

Synchronising. You might struggle to partition the data because queries need to span the entire dataset; therefore you may create four identical copies of the database and put a load balancer across the top (another form of partition and resolve). But what if the data is Read/Write? Then you must use a replication scheme for the updates so write-reads are consistent even when each part is served by different servers. This will give the problems of replication latency (i.e. you make an update on server 1, but your subsequent read happens against server 2 too soon and the update you made appears to have been lost because it has not arrived on server 2 yet). To avoid replication latency you may use a distributed transaction but this is not ideal as it will defeat scalability. Synchronising can also lead to merge conflicts... There are a variety of solutions but they are dependent on the nature of the problem.

Just thought I'd mention it, because although a lot of people partition and synchronise they forget about the aggregation issue.

Published 04 June 2008 23:01 by James.Saull
Filed under:

Comments

No Comments
Anonymous comments are disabled
Powered by Community Server (Personal Edition), by Telligent Systems