Same day, different moments

20170930-DSC0780920170930-DSC0783620170930-DSC07853

Posted in Photography | Tagged | Leave a comment

Old songs and memories of faded colors

Posted in Photography | 1 Comment

Another Spring

20160221-DSC03562 220160221-DSC03551 2

Posted in Photography | Tagged | Leave a comment

Why did we build our own blob store?

In a world increasingly dominated by open source technology, we understand that we shouldn’t reinvent the wheel (unless there are good reasons to do so). The Bob (read as Bag of Blobs) team at Upthere is a small group (four members as of now), but still we decided to work on this very challenging project — to build our own blob store. This seems unnecessary at the first glance, given that there are already some pretty good open source storage solutions like HDFS and Ceph. Upthere is not a startup at its early days but yet still doesn’t have abundant resources to build everything of our own. So why do we want to devote efforts to build our own blob store from scratch?

There are apparent benefits such as better integration with our in-house health/monitor systems and so on. But these alone don’t justify the big efforts we need to pay. In this blog post, let me explain in detail why HDFS and Ceph aren’t the best fit for us.

Let’s start from HDFS. Hadoop distributed file system is a good companion to Hadoop and has demonstrated its many strengths in practice. However, there are several performance issues that we don’t like. For example,

  • The single NameNode in HDFS is a bottleneck of the system. As all metadata is stored in NameNode’s memory and you cannot increase its memory size arbitrarily, the entire system is going to run out of metadata space eventually (which is more likely to happen with a lot of small files). Even if it is possible to get away with huge memory and file wrapper tricks, certain operations (e.g., scanning the entire metadata) will still be bottlenecked by the single node. HDFS federation is a possible solution to this problem but files will be separated to different namespaces (could be a good thing sometimes). We want a storage that can scale out in a better way.
  • HDFS relies on a local file system for writing data blocks as files (which is fine) but it doesn’t schedule the I/O itself. Even if a HDFS block can be many megabytes (e.g., 64-256MB) in size, when it gets written on disk by the local file system, it may not be written in a single run and may be fragmented. This is particularly true if there are multiple writers that attempt to write to the same disk at the same time, as the local file system scheduler tries to maintain fairness among multiple clients with a finer granularity. So it can be observed that the number of disk seeks increases with the number of multiple writers. We all know that disk seeks lead to bad I/O performance.
  • When blocks are fragmented on disk, reads also require more seeks to finish. OS page cache won’t help much because HDFS blocks are big and usually only read once in each application in a streaming way. Actually in that case it is preferred to totally bypass the OS cache and transfer data directly from disk to user space.

Continue reading

Posted in Technology | Tagged , , , | Leave a comment

USENIX FAST’16 Comments (part II)

Continued from the previous post…

WiscKey: Separating Keys from Values in SSD-Conscious Storage

This one is very interesting. Classic LSM-tree based key-value stores batch I/Os to take advantage of HDDs sequential throughput. To enable efficient lookups, LSM-trees continuously read, sort, and write key-value pairs in the background, resulting IO amplification. LevelDB and RocksDB are two popular key-value stores built on top of LSM-trees.

lsm-leveldb

Directly applying them to SSDs is not a good idea, because

  • The difference between random and sequential IOs’ performance is not as huge as HDDs.
  • SSDs have a large degree of internal parallelism.
  • SSDs can wear out through repeated writes.

WiscKey’s central idea is to separate keys and values.

  • Only keys are kept sorted in the LSM-tree, while values are stored separately in a log. As a result, key sorting and garbage collection are decoupled.
  • As values are unsorted, parallel random-reads are used for range queries to exploit SSDs internal parallelism.

wisckey-data-layout.png

WiscKey has been shown up to 100x faster than LevelDB in microbenchmarks. It is also faster than both LevelDB and RocksDB in YCSB workloads.

Due to limited time (mostly my laziness), I’m not going to write in-depth analysis here. But, I think this paper deserves careful reading. I would refer interested readers to the original paper for many technical details and discussions.

Estimating Unseen Deduplication – from Theory to Practice

Okay, dedup. Deduplication is important for storage systems built for large datasets, is well-studied, and yet is difficult. (When talking about dedup, Yale alumni Kai Li and his Data Domain Inc. always pop up in my mind :).)

How to deduplicate data is one aspect of the topic, but today we’re talking about another aspect — how to estimate the deduplication ratio before the actual dedup operation is carried out. This is useful because it helps one early at the planning phase. A misplanned hardware/software platform might requires much more cost to increase the capacity or leave much more unused resource to be wasted. Therefore, a good estimation of the dedup ratio is a very important piece in the initial system design.

Continue reading

Posted in Technology | Tagged | Leave a comment

USENIX FAST’16 Comments (part I)

It has been a few weeks since the upthere bob team attended FAST’16. There were a lot of interesting work presented at the conference. In a series of posts, I will briefly discuss some of them with my personal comments. Interested readers are referred to the technical papers for greater coverage and details.

Google wants new disks for data centers

Eric Brewer (the CAP theorem’s Brewer) gave a great keynote talk on hard drives for data centers. He had a companion blog post together with a white paper to supplement his talk. To summarize,

  • Storage requirement in data centers grows exponentially over the years. For example, YouTube users upload over 400 hours of video every minute, which at one gigabyte per hour requires a petabyte of new storage every day.
    YouTube-Growth-1200x726
  • SSDs are great but they cannot totally replace HDDs yet due to the still higher capacity/$. Interestingly, the growth rate in capacity/$ between SSDs and HDDs are relatively close (so SSDs are not catching up in that aspect).
  • As a typical Google way of organizing distributed components, drives in the data centers are treated as a whole collection rather than individuals. As a result, they can have a targeted accumulative objective and use linear program to optimize their purchase from different vendors and models, with the solution space in $/GB and IOPS/GB (combining drives along the bottom of the convex hull)!
    convex-hull
  • Read tail latency is one of the most important design consideration as it’s much harder to guarantee than writes (we can do many tricks to hide write latency). One approach to reduce read tail latency is to send multiple requests and cancel others when the first one finishes. This suggests that each hard drive should retry w/o too much effort and fail quickly if it cannot succeed, as other hard drives might succeed anyway. We can always come back and ask it to retry really hard if necessary.
  • Probably we should rethink the form factor of hard drives. The 3.5-inch form factor was mainly for backward compatibility with legacy (optical) drives. However, we might benefit a lot by moving away from that and putting several disks (of a different form factor) into a box (not a NAS box). I’ll leave the readers to read the details in the white paper.

Continue reading

Posted in Technology | Tagged | 2 Comments

Merry Christmas!

When Ryan found his Christmas gift this morning…

Posted in Photography | Tagged , | Leave a comment

Go Upthere

Sign up here to start exploring the new journey for all your digital assets.

It has been quit a while since I last updated my posts. I have been very busy…

Together with many talented team members at Upthere, I worked extremely hard on revolutionizing the future of cloud. We recently got out of stealth mode and announced our first two products. You should checkout www.upthere.com for the beautifully designed website to learn our story.

Upthere was co-founded by Bertrand Serlet, Roger Bodamer and Alex Kushnir. I’m not going to repeat the many interesting stories about them but you can definitely checkout this detailed article by techcrunch. They have another article introducing the apps we currently offer. We are in beta now and the service is totally free at this stage. You’re welcome to join us and give us valuable feedback. Go Upthere!

When I get time, I will start to write some new posts to discuss the technical challenges we are facing and how we are addressing them. Stay tuned :).

Sign up here to start exploring the new journey for all your digital assets.

Upthere Logo

Posted in Technology | Tagged , | Leave a comment

Hardware Aware Algorithms (part II)

We have briefly discussed the importance of hardware awareness in algorithm design in the previous post with two naive examples. This time, we will look at a less naive parallel programming example. I will show that depending on the cache friendliness, two algorithms can have different performance even with exactly the same complexity.

Imagine we have an array and want to compute some pairwise statistics. To make it dumb, let’s assume the pairwise stats is simply the sum of two items. To make it even dumber, assume we are interested in the total of these sums. Then a straightforward single-thread algorithm works as follows. Note that I deliberately make the inner loop go from 0 instead of i+1 and don’t reuse the loop to do something like s += 2*(a[i] + a[j]), to save some efforts in careful load balancing for the parallel version I’m going to discuss.

for i := 0; i < nn; i++ {
    for j := 0; j < nn; j++ {
        s += a[i] + a[j]
    }
}

To make the algorithm parallel, probably the simplest approach is to parallelize the outer loop. In the following code, we use Go channel to allocate computing resources. In Java, one can use a thread pool for the same purpose. The real logic is implemented as the closure g.

Continue reading

Posted in Algorithm | Tagged , , , , , , | Leave a comment

Hardware Aware Algorithms (part I)

Modern computer architecture has many powerful hardware features that help masking various latencies in the system and ease the software design. The memory/cache hierarchy is a good example. As a software developer, it is often very helpful if he/she understands some hardware fundamentals in order to take advantage of those advanced features for writing highly efficient programs. Unfortunately, many algorithm courses at school focus largely on computational complexity and touch little or even no hardware aware designs. In this post, I’d like to use some very simple examples to demonstrate their importance.

Note: As I’m picking up Go as a new tool in my programming language toolbox, I’ll write Go programs below. However, the results shown should be totally independent of the language choice. One can try C or Java to get similar results.

Probably most programmers have seen the following example traversing a 2D array.

// scanner over 2D array.
type scanner interface {
    compute(in *[m][m]float64) float64
}

// horizontal order.
type horizontal struct{}

func (p horizontal) compute(in *[m][m]float64) float64 {
    var sum float64
    for i := 0; i < m; i++ {
        for j := 0; j < m; j++ {
            sum += in[i][j]
        }
    }
    return sum
}

// vertical order.
type vertical struct{}

func (p vertical) compute(in *[m][m]float64) float64 {
    var sum float64
    for i := 0; i < m; i++ {
        for j := 0; j < m; j++ {
            sum += in[j][i]
        }
    }
    return sum
}

Although both algorithms have exactly the same complexity, the row-based or horizontal scan performs much better than the column-based or vertical scan. In my test, the former averages about 1.6 seconds, while the latter averages about 29 seconds.

Continue reading

Posted in Algorithm | Tagged , , , , , | 1 Comment