Shashank Tiwari - Using Oracle Berkeley DB as a NoSQL Data Store

“NoSQL” is the new popular buzzword among developers, architects and even technology managers. However, despite the term's newfound popularity, surprisingly there is no universally agreed-upon definition for it.

Generally, any database that isn’t RDBMS, upholds schema-less structures, is generally relaxed on ACID transactions, and promises high availability and support for large data sets in horizontally scaled environments is popularly categorized as a “NoSQL data store”. Given that these common features often seem in direct contrast to those of a good old RDBMS, some people propose non-relational, perhaps shortened as NonRel, as a more appropriate term than NoSQL.

Regardless, while the definitional conflict continues, many have begun to realize the benefits of NoSQL data stores by including them in their application stack. The rest are keeping a close watch and evaluating if NoSQL is right for them.

The growth of NoSQL as a category has also led to the emergence of a number of new data stores. Some of these new NoSQL products are good at persisting JSON-like documents, some are sorted ordered column-family stores, and others persist distributed key-value pairs. While newer products are exciting and offer many nice features, a few existing ones rise up to deliver the new promise as well.

One such data store is Oracle Berkeley DB. In this article I will illustrate and explain why and how Berkeley DB can be included in the stack as a NoSQL solution. The article focuses exclusively on Berkeley DB’s NoSQL-centric features and thus does not exhaustively cover all of Berkeley DB’s capabilities and idiosyncrasies.

Berkeley DB Essentials

Fundamentally a key-value store, Berkeley DB comes in three distinct flavors:

Berkeley DB – key-value store programmed in C. (Berkeley DB official documentation uses the term key-data in place of key-value.) This is the "classic" flavor.
Berkeley DB Java Edition (JE) – key-value store re-written in Java. Can easily be incorporated into a Java stack.
Berkeley DB XML – Written in C++, this version wraps the key-value store to behave as an indexed and optimized XML storage system.

(Note: Although this article does not expressly cover Berkeley DB JE or Berkeley DB XML, it does include examples that use the Java API and the Java-based persistence frameworks to illustrate the capabilities of Berkeley DB.)

Simple as it may be at its core, Berkeley DB can be configured to provide concurrent non-blocking access or support transactions, scaled out as a highly available cluster of master-slave replicas, or in a number of other ways.

Berkeley DB is a pure storage engine that makes no assumptions about an implied schema or structure to the key-value pairs. Therefore, Berkeley DB easily allows for higher level API, query, and modeling abstractions on top of the underlying key-value store. This facilitates fast and efficient storage of application specific data, without the overhead of translating it into an abstracted data format. The flexibility offered by this simple, yet elegant design, makes it possible to store both structured and semi-structured data in Berkeley DB.

Berkeley DB can run as an in-memory store to hold small amounts of data or it can be configured as a large data store, with a fast in-memory cache. Multiple databases can be set up in a single physical install with the help of a higher-level abstraction, called environment. One environment can have multiple databases. You need to open an environment and then a database to write data to it or read data from it. It’s advised that you close a database and the environment when you have completed your interactions to optimally use resources.

Each item in a database is a key-value pair. The key is typically unique but you could have duplicates. A Value is accessed using a key. A retrieved value can be updated and saved back to the database. Multiple values are accessed and iterated over using a cursor. Cursors allow you to loop through the collection of values and manipulate them one at a time. Transactions and concurrent access are additionally supported.

The key of a key-value pair almost always serves as the primary key, which is indexed. Other properties within the value could serve as secondary indexes. Secondary indexes are maintained separately in a secondary database. The main database, which has the key-value pairs, is therefore also sometimes referred to as the primary database.

Berkeley DB runs as an in-process data store, so you statically or dynamically link to it when accessing it using the C, C++, C#, Java or scripting language APIs from a corresponding program.

With that whirlwind introduction, the following section describes Berkeley DB in the context of its NoSQL-centric features.

Fluid Schema

The first benefit of a NoSQL store is its relaxed attitude toward well-defined database schemas. Let us see how Berkeley DB fares on this feature.

To appreciate Berkeley DB’s capabilities, I suggest you take it for a spin. Therefore, it is recommended that you download and install both Berkeley DB and Berkeley DB JE on your machine so you can try some examples yourself and follow along with the rest of the illustrations in this article. Download links and installation instructions are available online here. (I compiled Berkeley DB with --enable-java, --enable-sql, and --prefix=/usr/local for this article.) The fundamental concepts that relate to storage, access mechanisms and the API don’t vary much between Berkeley DB and Berkeley DB JE so most things I cover next apply equally well to both.

Berkeley DB itself imposes little restrictions on the data items beyond it being a collection of key-value pairs. This allows applications to flexibly use Berkeley DB to manage data in a variety of formats, including SQL, XML and Java objects. You can access data in Berkeley DB via the base API, the SQL API, the Java Collections API, and the Java Direct Persistence Layer (DPL). It allows a few different storage configurations: B-Tree, Hash, Queue, and Recno. (The Berkeley DB documentation refers to the different storage mechanisms as “access methods”. Hash, Queue, and Recno access methods are available only in Berkeley DB and not in Berkeley DB JE or Berkeley DB XML.)

Depending on your specific use case, you could choose an access mechanism and a storage configuration. This choice of a particular access method and a storage configuration can affect schema. To understand the impact of the choices, you need to first understand what they are. I cover the access methods and storage configurations next.

Using the Base API

The Base API is a low-level API that allows you to store, retrieve and update data, the key-value pairs. This API is similar across different language bindings. Therefore the Base API for C, C++ and Java are quite the same. DPL and the Java Collections API, on the other hand, are only offered as an abstraction in the Java API.

The Base API puts, gets and deletes key-value pairs. Both the key and the value are an array of bytes. All key and data values get serialized to byte arrays before they are stored. One could use Java’s built-in serializer or Berkeley DB’s BIND API to serialize various data types to byte arrays. Java’s built-in serializer is typically a slow performer so one must go in favor of the BIND API. (The jvm-serializers project benchmarks various alternative serializers and is a good reference point to analyze relative performance among different serialization mechanisms in the JVM.) The BIND API avoids redundantly storing the class information with every serialized class and instead puts that information in a separate database. Potentially, you could makes things faster by writing your own custom tuple binding to improve the BIND API performance.

As an elementary example, you can have a data value defined as follows:

import java.io.Serializable;
public class DataValue implements Serializable {
    private long prop1;
    private double prop2;

    DataValue() { 
      prop1 = 0;
      prop2 = 0.0;
    }

    public void setProp1(long data) {
      prop1 = data;
    }
    
    public long getProp1() {
      return prop1;
    }
    
    public void setProp2(double data) {
      prop2 = data;
    }
    
    public double getProp2() {
      return prop2;
    }
}

Now, you can store this data value using two databases, one to store the value with a key and another to store the class information.

The data is stored using four distinct steps:

First a second database, apart from the one to store key-value pairs, is configured to store class data as follows:
```
Database aClassDB = new Database("classDB", null, aDbConfig);
```

Then a class catalog is instantiated as follows:


StoredClassCatalog storedClassCatalog = new StoredClassCatalog(aClassDb);

A serial entry binding is established like so:


EntryBinding binding = new SerialBinding(storedClassCatalog, DataValue.class);

Finally, a DataValue instance like so:


DataValue val = new DataValue();
val.setProp1(123456789L);
val.setProp2(1234.56789);

is mapped to a Berkeley DB DatabaseEntry, which serves as a wrapper for both the key and value, using the binding you just created as follows:

DatabaseEntry deKey = new DatabaseEntry(aKey.getBytes("UTF-8"));
DatabaseEntry deVal = new DatabaseEntry();
binding.objectToEntry(val, deVal);

Now you are ready to put the key-value pair in Berkeley DB.

The base API supports a few variants of the put and get methods, to allow or dis-allow duplicates and overwrites. (The intent of neither this example nor the article is to teach you the detailed syntax or semantics of using the base API so I will not get into further details; see the doc here.) An important takeaway is that the base API allows for low-level manipulation and custom serialization around storing, retrieving and deleting key-value pairs.

If you prefer to interact with Berkeley DB with a higher level API, then you should use DPL.

Using DPL

Direct Persistence Layer (DPL) provides familiar Java persistence framework semantics for manipulating objects. You can treat Berkeley DB as an entity store where objects are persisted, retrieved to be updated and deleted. DPL uses annotations to mark a class as an @Entity. Associated classes, which get stored with an Entity, are annotated as @Persistent. Specific properties or variables can be annotated as @PrimaryKey and @SecondaryKey. A simple Entity could be as follows:

@Entity
public class AnEntity {

    @PrimaryKey
    private int myPrimaryKey;

    @SecondaryKey(relate=ONE_TO_ONE)
    private String mySecondaryKey;
    ...
}

DPL imposes the class definition as a well-defined schema. From the base API we know that such a conformance to schema is not a requirement for Berkeley DB. For some use cases, however, formal entity definitions are helpful and provide a structured approach to data modeling.

Storage Configuration

As mentioned previously, key-value pairs can be stored in four different types of data structures: B-Tree, Hash, Queue and Recno. Let’s see how they stack up.

B-Tree. A B-tree needs little introduction but if you do need to review its definition then read the Wikipedia page on B-Tree, available online at http://en.wikipedia.org/wiki/B-tree. It’s a balanced tree data structure that keeps its elements sorted and allows for fast sequential access, insertions and deletions. Key and values can be arbitrary data types. In Berkeley DB the B-tree access method allows duplicates. This is a good choice if you need complex data types as keys. It’s also a great choice if data access patterns lead to access of contiguous or neighboring records. B-Tree keeps a substantial amount of metadata to perform efficiently. Most Berkeley DB applications use the B-Tree storage configuration.
Hash. Like the B-Tree a hash also allows complex types to be keys. Hashes have a more linear structure as compared to a B-Tree. Berkeley DB hash structures allow duplicates.
While both a B-Tree and a hash support complex keys, a hash database usually outperforms a B-Tree when the data set far exceeds the amount of available memory. That is because a B-Tree keeps more metadata than a hash and a larger data set implies that the B-Tree metadata may not fit in the in-memory cache. In such an extreme situation the B-Tree metadata as well as the actual data record itself will often have to be fetched from files, which can cause multiple I/Os per operation. The hash access method is designed to minimize the I/Os required to access the data record and therefore in these extreme cases, may perform better than a B-Tree.
Queue. A queue is a set of sequentially stores fixed length records. Keys are restricted to logical record numbers, which are integer types. Records are appended sequentially allowing for extremely fast writes. If you are impressed by Apache Cassandra’s fast writes by appending to logs then give Berkeley DB with queue access method a try and you wouldn’t be disappointed. Methods also allow reading and updating effectively from the head of the queue. A queue has additional support for row-level locking. This allows effective transactional integrity even in cases of concurrent processing.
Recno. Recno is similar to a queue but allows variable length records. And like a queue, recno keys are restricted to integers.

The different configurations allow you to store arbitrary types of data in a collection. Similar to NoSQL, there is no fixed schema, other than those imposed by your model. In the extreme situation, you are welcome to store disparate value types for two keys in a collection. Value types can be complex classes, which for the sake of argument could represent a JSON document, a complex data structure or a structured data set. The only restriction really is that the value should be serializable to a byte array. A single key or a single value can be as large as 4GB.

The possibility of secondary indexes allows filtering on the basis of value properties. The primary database does not store data in a tabular format and so non-existing properties are not stored for sparse data sets. A secondary index skips all key-value pairs that lack the property on which the index is created. In general the storage is compact and efficient.

Support for Transactions

Berkeley DB is a very flexible database that can turn many features on and off. Berkeley DB can run without transaction support or it could be compiled to support ACID transactional integrity. Perhaps, the malleable nature of Berkeley DB makes it a very appropriate data store for many situations. Transactional integrity is the least supported feature in a quintessential NoSQL data store. Berkeley DB, in highly available systems, that don’t expect ACID transactional compliance can turn transactions off and work like a typical NoSQL product. However, in others it could be flexible and support transactional integrity.

While I don’t intend to cover details on transactions, it’s worth noting that like traditional RDBMS systems, Berkeley DB enabled with transactions allows definition of transactional boundaries. Once committed, data is persisted to disk. To enhance performance, one can use non-durable commits, where writes are committed to in-memory log files and later synched with the underlying file systems. Isolation levels and locking mechanisms are also supported.

Before a database is closed a sync operation assures that persistent file copies have up-to-date in-memory information in the system. The combination of this sync operation and Berkeley DB's transactional recovery subsystem (assuming that you have enabled transactions) ensure that the database is always returned to a consistent transactional state, even in the event of application or system failure.

Large Data Sets

Theoretically Berkeley DB has an upper bound of 256TB but in real life it’s usually bound by the size of the machine it runs on. At the time of this writing, Berkeley DB does not demonstrate support for extremely large files that span multiple machines with the help of distributed file systems. (Files larger than the size of a single node can be managed with the help of distributed file systems like the Hadoop Distributed File System, or HDFS.) Berkeley DB works better on local file systems than it does on network file systems. More accurately, Berkeley DB relies on the POSIX compliant attributes of a file system. For example, when Berkeley DB calls fsync() and the file system returns, Berkeley DB assumes that the data has been written to persistent media. For performance reasons, distributed file systems typically do not guarantee that a write has been completed all the way to the persistent media.

The maximum B-Tree depth supported is 255. Lengths of the key and value records are typically bound by the available memory.

Horizontal Scale-out

Berkeley DB replication follows a master-slave pattern. In such a pattern there is one master and multiple slaves or replicas. However, the selection of a master is not static and it’s recommended that the selection is not manual. All participants in a replication cluster go through an election process to choose the master. The one with the most up-to-date log records is the winner. If there is a tie, then priorities are used to select the master. The election process is based on an industry standard Paxos-compliant algorithm.

Replication has numerous benefits including the following:

Improves read performance – The availability of multiple replicated nodes to read data from improves the read performance drastically.
Improves reliability – The presence of replicated instances provides better failover options in times of node failure and data corruption.
Improves durability – You can relax durability guarantees on the master to avoid excessive write to disk operations, which typically involves expensive I/O. In a clustered environment the durability is enhanced by the fact that the write has been committed to multiple nodes, even if it’s not written to disk.
Improves availability – The presence of multiple nodes, along with asynchronous write to disk, makes it possible for replicas to keep serving operations even when the master is under heavy load.

Summary

Berkeley DB undoubtedly qualifies as a robust and scalable NoSQL key-value store; the use of Berkeley DB as the underlying storage for Amazon’s Dynamo, Project Voldemort, MemcacheDB, and GenieDB is further evidence supporting this claim. There has been a little bit of FUD around Berkeley DB performance, especially in the wake of couple of comparative benchmarks published online:

However, there are many live systems that prove Berkeley DB’s strengths. Many of these systems, through careful tuning and application coding improvements, have achieved excellent scalability, throughput, and reliability results. Following the lead of those systems, Berkeley DB can certainly be used as a scalable NoSQL solution.

Shashank Tiwari is Founder & CEO of Treasury of Ideas, a technology driven innovation and value optimization company. As an experienced software developer and architect, he is adept in a multitude of technologies. He is an internationally recognized speaker, author and mentor. As an expert group member on a number of JCP (Java Community Process) specifications he has been actively participating in shaping the future of Java. He is also an common voice in the NoSQL and Cloud Computing space and a recognized expert in the RIA community.