Moving from MySQL to Couchbase Server – Part 2

In the first part, we looked at the basic mechanics of MySQL and Couchbase Server 2.0, and compared the ways in which you would model the data, and how the basic queries and lists of records operate, from the MySQL SQL statement, to the Couchbase Server View.

Author: MC Brown, Couchbase, http://www.couchbase.com/

This article was originally published on http://blog.couchbase.com/how-move-mysql-couchbase-server-20-part-2 and is reproduced here with Couchbase permission.

This time, we’ll start by looking at how to write those Views and specific elements of the SQL queries, like WHERE and GROUP BY clauses, and the process of actually migrating your data and application logic to Couchbase Server.

Think about the queries you want to execute

As we’ve already seen, querying a Couchbase Server database is really a two-stage process. The first stage involves creating a view on your data that supports searching or selecting on the documents in your database. The second half is the URL and key values that you want to select out of the view. This means that when thinking about how you want to get the information out of your Couchbase database you need to think about how you want to search the data, and that will allow you to define the structure of the view that needs to be written.

Couchbase Server uses views to create lists of documents from the database, and the output from a view is a key and a corresponding value. Both keys and values can be any JSON value. The key in the output is significant because it forms the basis of the searching, sorting and pagination mechanism.

For example, a view function that just outputs all of the recipe documents in the database, using the recipe title as the key, might look like this:
function(doc) { if (doc.type === 'recipe' && doc.title !== null) { emit(doc.title, null); } }

This is called the map function – it maps the information from the documents into a format that you want to use. An optional second step is called the reduce function, and is analogous to the MySQL aggregation functions and GROUP BY clause.

When migrating from MySQL, construction of your views to match the information that you want to output or query can also affect how you store the information. Let’s return to the recipe example. In MySQL, you would use a query on the ingredients table to match against carrot and get a list of matching recipes. For example, as an SQL query you would express this as :
SELECT recipe.id, recipe.title FROM ingredients join (recipe) on (ingredients.recipeid = recipe.id) where ingredient = 'carrot'

In Couchbase, one way to achieve the same result is to write a view that outputs one row for each ingredient from our recipe document, like this:
function(doc) { if (doc.type == 'recipe' && doc.title !== null) { for id in doc.ingredients { emit(doc.ingredients[id].ingredient, null); } } }

To query, you must specify the key from the output that you want to match. Querying is performed over the REST API, or through your client library, but the specification is the same. For example, if you want to search for “carrots”, you must specify this as the key you are looking for. Using the REST API:
http://localhost:8092/recipes/_design/dev_ingred/_view/byingred?key=%22carrots%22&connection_timeout=60000&limit=10&skip=0

Remember that the key is any JSON value, so the key values you supply to the query must be correctly encoded JSON too. As you can see from this example, the view provides the basis of your query, and the database access the selection mechanism.

The use of views in this way means that most applications will end up being constructed with a number of simultaneous views. The process of building your views is really no different to constructing your core queries within your application, and then optimizing the query and indexes that produce the output. The main difference is that the view definition is stored within the database, not as a query within your application code.

Aggregations

In MySQL aggregation is used in a number of different places. Aggregation is handled by a combination of the GROUP BY clause and functions that collect or summarize the information. Using our recipe example again, we might run a query that provides a count of the different recipes by the ingredients they contain. For example, a query like this:
SELECT ingredient,count(recipeid) FROM ingredients GROUP BY ingredient

is going to provide us with a list of ingredients, with a count next to each one to show how many recipes have that ingredient. In a final application this goes to make up the ‘hottest recipe ingredients’, or in other apps ‘most popular blog post’ and so on.

For example, to get the equivalent of the above we can use the same map function as we’ve used above. Then we can use one of the built-in reduce functions, _count, which will produce a count based on the number of unique keys output by the emit() function. If a reduce function exists, it is used automatically when the view is accessed. Accessing the view produces a list of the number of recipes for individual ingredients:
{"rows":[ {"key":"banana","value":20}, {"key":"bananas","value":33}, {"key":"baps","value":1}, {"key":"barbecue sauce","value":1}, {"key":"basmati rice","value":16}, {"key":"bay leaf","value":58}, {"key":"bay leaves","value":35}, {"key":"bean sprouts","value":18}, {"key":"bean thread noodles","value":1}, {"key":"beef braising steak","value":17} ] }

Aggregations and reduce functions can be used for other operations too. For example, in part 1 I showed a recipe having timing information for the preparation and cooking. We could use a map and reduce function together to get the total cooking time for each recipe. First, write a map function that outputs the recipe title as the key, and the cooking time information as a value:
function (doc, meta) { if (doc.title) { emit(doc.title, parseInt(doc.preptime)); emit(doc.title, parseInt(doc.cooktime)); } }

We ensure that the value output is a number (using the JavaScript parseInt() function). Using the built-in _sum function will result in the view calculating the total of all the values against a given key (our recipe title):
{"rows":[ {"key":"Apple pie","value":55}, {"key":"Apricot cheesecake","value":70}, {"key":"Barbecued aubergine","value":10}, {"key":"Barbecued beefburgers","value":30}, {"key":"Barbecued corn on the cob slices","value":13}, {"key":"Beef in red wine","value":144}, {"key":"Blue cheese and tomato dip","value":5}, {"key":"Cajun chicken","value":20}, {"key":"Caribbean cobbler stew","value":95}, {"key":"Chappati or roti","value":45} ]

In SQL you might just store away the total cooking time in the table to make it easy to query on this information. Within Couchbase, we can make use of the ability to process and combine information when building the view.

Range Queries

More complex queries, where you are selecting from multiple fields is a case of building a suitable view, and then using the key query argument to specify what you are looking for. Where you want to pull out information for a range of values, you can use the start-key and end-key to specify the range to be returned by the view. For example, constructing a view with the cooking time as a key would let you output a list of all the matching recipes where the cooking time was between 5 and 25 minutes:
http://localhost:8092/recipes/_design/recipes/_view/by_cooktime?startkey=%225%22&endkey=%2225%22

This analogous to:
SELECT title FROM recipe WHERE cooktime >= 5 AND cooktime <= 25

Querying on Multiple Fields

To query on multiple fields, you need to construct a view that outputs the two (or more) fields that you want to query on. When looking for recipes, for example, it’s fairly common to want to look for recipes with a certain ingredient, but also within a certain time limit. When you open your fridge 30 minutes before going out and finding tomatoes, you want to know what you can cook in 20 minutes with tomatoes. To do that, we output a view that emits a key with the ingredient, and the time required to cook the recipe, like this:
function (doc, meta) { if (doc.ingredients) { for (i=0; i < doc.ingredients.length; i++) { if (doc.ingredients[i].ingredient != null) { emit([doc.ingredients[i].ingredient, parseInt(doc.totaltime)], null); } } } }

Now to get all of the recipes with tomatoes cookable in 20 minutes we can search for:
[“tomatoes”,20]
Or more explicitly:
http://localhost:8092/recipes/_design/dev_bytime/_view/byingredtime?full_set=true&key=%5B%22tomatoes%22%2C20%5D&stale=false&connection_timeout=60000&limit=10&skip=0

Of course, if you have 20 minutes to cook a recipe, you have less than 20 minutes. So in fact we can use a range query with a startkey of [“tomatoes”,0] and an endkey of [“tomatoes”,20] and now we’ll get all recipes cookable within 20 minutes containing carrots:
{"total_rows":27446,"rows":[ {"id":"4997AFDE-3027-11E2-BB0D-B67A7E241592","key":["tomatoes",5],"value":null}, {"id":"79C16D8A-3027-11E2-BB0D-B67A7E241592","key":["tomatoes",5],"value":null}, {"id":"CF52D23E-3027-11E2-BB0D-B67A7E241592","key":["tomatoes",5],"value":null}, {"id":"42F5D520-3027-11E2-BB0D-B67A7E241592","key":["tomatoes",6],"value":null}, {"id":"1A0AB61C-3027-11E2-BB0D-B67A7E241592","key":["tomatoes",8],"value":null}, {"id":"4D0F9DF2-3027-11E2-BB0D-B67A7E241592","key":["tomatoes",8],"value":null}, {"id":"50CE1676-3027-11E2-BB0D-B67A7E241592","key":["tomatoes",10],"value":null}, {"id":"564A66CC-3027-11E2-BB0D-B67A7E241592","key":["tomatoes",10],"value":null}, {"id":"95E9464A-3027-11E2-BB0D-B67A7E241592","key":["tomatoes",10],"value":null}, {"id":"9A78477E-3027-11E2-BB0D-B67A7E241592","key":["tomatoes",10],"value":null} ] }

Cool! Compared to the SQL statement:
SELECT recipeid from recipes JOIN ingredients ON (ingredients.recipeid = recipe.recipeid) WHERE ingredients.ingredient = 'tomatoes' and totaltime <= 20

Incidentally, note that we get the ID of each document so that we can load the document, and therefore the recipe. We might not have the recipe title in our output, but the client libraries can automatically load the document information for us. The hard part is performing the search.

Sorting View Output

Sorting of the output of a view is automatic. The output of every view is automatically sorted by the key generated by each view. This is needed because without it, the range query would be impossible to achieve. This makes some operations quick and easy. If you want to output a list of recipes in alphabetical order by their title, you would create a view that outputs the recipe title as the key. This means that the map function:
function (doc, meta) { emit(doc.title, null); }

Accessed through the View as:
http://localhost:8092/recipes/_design/recipes/_view/by_title

Is analogous to:
SELECT title FROM recipe ORDER BY title

To get a list in descending order in MySQL you use the DESC keyword:
SELECT title FROM recipe ORDER BY title DESC

In Couchbase Server you use the descending parameter to the URL:
http://localhost:8092/recipes/_design/recipes/_view/by_title?descending=true

Pagination

In MySQL, paging is handled by using a combination of the LIMIT and OFFSET clauses, for example to get the first 10 records from our recipe title query:
SELECT title FROM recipe ORDER BY title LIMIT 10

To get the next 10 records:
SELECT title FROM recipe ORDER BY title LIMIT 10 OFFSET 10

With Couchbase Server you use a similar setup, by supplying the limit and skip arguments to the view query. To get the first ten records:
http://localhost:8092/recipes/_design/recipes/_view/by_title?limit=10

And the next 10 records:
http://localhost:8092/recipes/_design/recipes/_view/by_title?limit=10&skip=10

When dealing with relatively short blocks of pages this works well. For longer pages, or where you are potentially skipping over a large number of rows of output, you can use the startkey_docid to skip forward within the view. This is more efficient for these longer pagination scenarios. Check the documentation for more information.

Loading the core object

In our example recipe database in MySQL, there are multiple tables required to load the database information about a single recipe so that it can be displayed. Within Couchbase Server, once you have the document ID for a given recipe, you have access to entire recipe record structure as it has been stored within the Couchbase Server database.

With MySQL the need to simplify the loading mechanism for such complex objects has lead to a number of different supporting projects that can be as a way to cache the loaded complex data so that it is more accessible (memcached, for example). Couchbase Server doesn’t eliminate the need to load the information from the database, but it does make loading the recipe record a single operation, rather than multiple queries and formatting of that data into the structure used for display.

In SQL for example you might load the recipe data by looking at the individual tables and constructing your object, running a number of different queries:
SELECT * FROM recipe WHERE recipeid=23434 SELECT * FROM ingredients WHERE recipeid=23434 SELECT * FROM methods WHERE recipeid=23434 SELECT * FROM keywords WHERE recipeid=23434

Once you know the document ID, getting the content of the entire document is a single query to Couchbase Server. For example, if you get the document ID Aromaticroastchicken from our ‘by ingredient’ view, I can get the entire recipe for display with a single GET request using the client library. For example, in PHP:
$recipe = couchbase->get(‘Aromaticroastchicken’);

The result is the recipe data as JSON:
{ "title" : "Aromatic roast chicken", "preptime" : "15", "servings" : "6", "totaltime" : "120", "subtitle" : "A couscous stuffing, and aromatic spices with fragrant lime and roasted garlic, turn a simple roast chicken into something special.", "cooktime" : "105", "keywords" : [ "diet@dairy-free", "cook method.hob, oven, grill@oven", "diet@peanut-free", "special collections@cheffy recommended", "diet@corn-free", "special collections@weekend meal", "occasion@entertaining", "special collections@very easy", "diet@yeast-free", "diet@shellfish-free", "special collections@feeling spicy!", "main ingredient@poultry", "diet@demi-veg", "meal type@main", "diet@egg-free", "diet@cow dairy-free" ], "ingredients" : [ ... ], "method" : [ ... ], }

One query, one request to the database to load everything required to display the recipe on a single page. Other content could be included here of course, user comments, additional metadata, all stored within the single document and just a single request away.
Migrating the data

The formatting of the data brings up another salient point when thinking about that migration when I started the process. Think about your data and how it will be used. If you display information that, in MySQL, comes from multiple sources, then it makes sense to group this information together and store it in Couchbase Sever as a single document.

As already mentioned, many acceleration mechanisms used with MySQL work on the basis of caching the information once it has been loaded from multiple sources to speed up loading of such complex structures and information.

The same basic principle is true with Couchbase Server, bearing in mind the caveats I mentioned above about the structure. You need to be sure the structure you create allows you to construct a view that makes the content searchable, if that data is how you want to pull the documents out of the database.

For example, in our recipe structure, there is no point in storing the ingredients as a single text field and losing the detail of the content. This is akin to putting everything into a single text column within MySQL and has the same pitfalls; it makes it difficult to search and query the contents, and complicates both the view and display possibilities.

The actual migration process should be quite straightforward. Load the information from MySQL, construct a document for each major object in your database, and then write the docs into a Couchbase Server. In fact, you can even use this method to migrate the information on a live running application. Instead, try to laod the item from Couchbase, and if it doesn’t exist, load it from MySQL, then store the constructured object into Couchbase.

Documents in Couchbase are stored by their ID. The ID can be any string you like, which means that you can use something identifiable to you or your application without always having to rely on a view to find it. To use our ever-present recipe database, you could use the recipe title, for example ‘Fish Stew’. The limitation of this approach (just as with MySQL and unique indexes) is that you can’t store two documents with the same ID, and there are lots of different Fish Stew recipes out there. As an alternative you could use a UUID or use the recipe title with a suffix to allow multiple ‘Fish Stew’ recipes.

Just as with the MySQL AUTO_INCREMENT field option, Couchbase will automatically generate a UUID for each document. This isn’t enforced; you can store both documents a specific document ID and auto-generated IDs in the same database, which means you can use known IDs to store specific information, and auto-generated IDs for the main body of data.

Wrapping Up

The main difference between Couchbase and MySQL is the representation of the data and how you access it. For certain data models, Couchbase provides a simpler solution to the problem of loading the single-object style data that we’ve seen demonstrated in these pieces. Recipes are a good example where the key element of the data store is the recipe as a whole (our document). The ability to search and find recipe data and present it as a list is a requirement in terms of the way we need to present and access the data, not a requirement related to how it is stored.

Hopefully these two posts have given you some ideas about how your data can be moved from MySQL to Couchase Server, and how to model and develop your Couchbase Server data and application code to make the system work and operate like your existing MySQL-based infrastructure.