Streaming Analytics for NFS, Product Design and My Involvement in 3 Startups

CacheIQ RapidView NFS Operations and Network Througput Drilldown

Screen Shot 2012-01-12 at 8.03.52 PM

At Boundary I work on a variety of services that collect and analyze application flow data. This information is aggregated and presented to users via our streaming analytics platform. This post isn’t about Boundary though and Boundary isn’t my first foray into streaming analytics, it’s a problem I’ve been thinking about and working on for a while.

Starting Aceevo

Back in 2010 I started Aceevo, one of the goals with Aceevo was to focus on accelerating front-end web application development while bringing a more rigorous engineering discipline to the practice. As a developer I like strong typing, it’s just what I prefer and is probably more a function of my age, education and initial training in programming than anything else. The first language I learned was C and I am very comfortable with terse low level languages.

In the early 2000′s I made the move to developing web applications in Java it was initially conflicting for me but I quickly found myself enjoying working in that environment. One of the things that has always frustrated me though about web application development (not to mention the issue of reusability) was the lack of safety. I could build robust, testable, type safe systems on the backend and it all would come tumbling down to dreaded strings when we leave the application server enroute to the browser.

Now it doesn’t bug me too much, I’ve always been pretty good at Javascript. It’s more so the disconnect between the DOM, the data (our specific application data model), and the browser event model and having also done thick client development in GTK and Swing I sometimes wished for the power of rich client development in the browser.

Around 2008 I became fascinated with GWT and later in 2010 with Vaadin and that’s how Aceevo was born. In addition to providing traditional backend Java consulting services I used GWT and Vaadin as front-end libraries to write powerful, rich frameworks for rapidly developing the client side components of web applications. I had the tools to bring true OOD and DDD to web development and type safety was enforced by a complier.

In 2010 I was approached by the 4 founding engineers and a co-founder of CacheIQ to lead the design and implementation of their streaming analytics platform for NFS. In May of 2011, CacheIQ became my primary customer but before we talk about that I’ve got to tell you about Storspeed.

Good ‘ol Storspeed

Storspeed System Diagram

storspeed_sp5000_540

Storspeed was founded in 2007, I joined in late 2008 and officially started my employment in January of 2009, we came out of stealth 10 months later. Storspeed was initially a hardware driven company. They raised $13 million in funding and set out to hire, design and build the platform for a product that hadn’t been built before in a market that didn’t yet exist, Application Aware Caching.

Application Aware Product Sheet, complete with “What is Application Aware Caching?” and “Superior Visibility See What You Are Missing”. Also I’m pretty sure those are F-18s flying in diamond formation under “Performance”
72170v1

I’ve worked for a few hardware driven companies before and I can tell you the Storspeed hardware guys were pretty good, they delivered on time and under budget, there was only one problem, most of what we were doing was software.

It’s not a new story, it was a classic case of disconnect between hardware and software. The hardware was extremely powerful (and expensive) and was designed for wire speed network processing. The Storspeed SP5000 had 2 RMI/Raza XLR multicore CPUs which are traditionally used in routers, switches and firewalls.

Each processor had 8 cores and 4 hardware threads. 7 of the cores were allocated for the data plane and ran RMI XOS, a lightweight RTOS that provided a threading model for issuing work to the processor and 1 core ran Linux for the management plane. There were a bunch of other goodies in the SP5000 including FPGA’s and Fulcrum Ethernet switches to switch the NFS traffic from the CX4 interfaces to the processors.

The SP5000′s also had a lot of RAM and SSDs. The SP5000′s came in 3 and 6 node clusters and were fronted by 2 ethernet switches called Flow Directors each equipped with Fulcrum chips that ran custom code which tagged and switched packets to provide consistent and balanced hashing of NFS traffic across the cluster.

In addition a management application called “Sysman” ran on a Dell 2960 with 8 drives and several CPUs. Sysman provided all aspects of FCAPS as well as NFS analytics for the SP5000 and Flow Director cluster. Basically the Storspeed solution could take up an entire rack in your data center.

A Million IOPS and Other Simple Product Requirements

The overarching goal for the product was always about throughput, one million IOPS, I imagine the sales pitch went like this.

“SSDs are expensive and you’ve made an investment in spinning disks, spinning disks are slow, we make them go fast.”

Additionally though Storspeed sought to develop a turnkey product that essentially is still unique to this day. They really tried to knock it out of the park. The Storspeed product was an application aware, policy driven cache that could front large NetApp, Blue Arc, EMC/Isilon NAS arrays with spinning disks and make them faster.

Ultimately it was a cache, and the function of a cache is pretty simple. But Storspeed’s product was unique in 4 ways.

1.) Seamless NFS Client Integration

Most caches require reconfiguration of NFS client mount points, the client mounts the cache and the cache mounts the filesystem on the remote NAS. Storspeed ethernet switches intercepted and switched the NFS traffic to the destination NAS through the Storspeed cluster, the cluster fronted the NAS filers and proxied all requests creating a seamless shim for the client, i.e. the client always thought it was talking to the NAS.

2.) Fault Tolerant

The flow directors management plane coordinated with both the SP5000 cluster nodes and Sysman to determine if the cluster was properly passing traffic, in the event of a data plane failure the flow directors would shutdown the lags to the cluster and switch NFS traffic around failed nodes so clients could continue to access the NAS.

3.) Policy Driven

You could configure the caching algorithms on the Storspeed cluster by defining policies for cache retention and cache eviction based on ingress and destination IP address or file type being accessed (not exactly application aware but you have to start somewhere).

4.) NFS Traffic Analytics

Sysman was supposed to provide low-level statistics and analytics about NFS traffic, in reality it provided one minute aggregations of NFS traffic statistics broken down on various dimensions.

Let’s Go Crazy

In order to meet these product requirements a lot of cool stuff (or not so cool depends on your outlook) happened on the software side.

Due to decisions that were made on the hardware design the “software guys” (network programmers/kernel developers) were doing things like writing their own scheduler because we didn't have a real OS and because it was required that we had to build 3 and 6 node clusters we had to implement our own Paxos.

We also attempted to sell this product to the the largest movie and gaming studios as well as leaders in the energy industries.

Not So Streamy NFS Analytics

Sysman – I know it says CacheIQ but this is what the Storspeed UI looked like, trust me, the ugly it burns.

CIQ

The original 2 engineers who worked on Sysman decided to leave the company so another engineer and I took over the job of designing and building the system to monitor and manage the entire Storspeed product infrastructure.

We first looked at the underlying technologies Sysman used and many fit the traditional enterprise Java stack (Spring, Hibernate, JDBC, PostgreSQL) but other aspects were totally wacky. Sysman was an OSGI/Eclipse RCP application that ran in the Eclipse Jetty plugin.

Sysman used Thrift to communicate with a process called LM (local manager) on the SP5000 cluster to receive NFS traffic statistics. LM kept a circular buffer of NFS Request Stats in shared memory and would send stats every 10 seconds or would flush if the buffer was about to wrap.

In OLAP terms our cube looked something like this.

dimensions
– epoch
– source IP (NFS client)
– dest IP (Filer)
– file id
– policy id

measurements
– read hits (ops)
– read misses (ops)
– write hits (ops)
– write misses (ops)
– meta hits (ops)
– meta misses (ops)
– read mb hits
– read mb misses
– write mb hits
– write mb misses

Stats were written to a raw stats table in PostgreSQL and jobs would run every minute to aggregate the raw stats by dimension and write the aggregated output to different tables. The business logic for aggregation was done in SQL.

Needless to say this design fell over pretty quickly as soon we hit a customer with a large number of clients or files due to the cardinality of those dimensions.

The product itself failed to provide much insight and the usability and UX of the product were poor. In fact you had to manually pull the scrubber over to the leading edge of the graph every minute if you wanted to see the UI update, complete bummer.

On a more positive note the code was well designed, maintainable and easily extensible. They had built a solid framework for building applications and on the UI side the GWT implementation addressed many of the fundamental underlying problems with rich client web application development.

As an aside though the UI code wasn’t the sort of stuff you could give to a Javascript developer and expect them to run with it, it was written in Java. There was no way to just bring a talented UI developer to help with visualization given that the majority of the code was in GWT.

I knew that customers needed near real-time visibility of NFS application flow data to determine what the current state of the system was at any point in time. Unfortunately the backend wasn’t designed for this and the previous engineers chose to implement the critical visual analytics pieces of the product in ActionScript and BIRT.

In March of 2010 Storspeed abruptly halted operations and joined the dead pool.

CacheIQ

CacheIQ RapidView Cluster Dashboard
Screen Shot 2011-12-17 at 11.10.57 PM

In mid 2010 CacheIQ was formed as a reboot of Storspeed and I was approached again to lead the development of the management and analytics platform. The business requirements were the exact same but the software team had shrunk from about 15 to 4 engineers.

The hardware was also quite different. CacheIQ’s RapidCache product ran FreeBSD and used 6 Nehalem Intel processors and 2 20Gb Full Duplex Intel ethernet NICs LAG’d together to send upwards of 40Gb/sec across the backplane of a single 2U box with a lot of RAM and SSDs.

I was initially very hesitant having decided to start Aceevo and pursue different goals, but I picked up a bit of work with CacheIQ part-time on the weekends helping them port the existing Sysman application over to the new platform.

Unfinished Business

As an engineer the allure of finishing what I had started was too great and by May of 2011 I had wrapped up Aceevo’s other full-time engagements and signed a consulting agreement to focus on CacheIQ’s streaming NFS analytics platform full-time.

With me onboard that made 5 engineers and each of us we were working over 60-80 hours a week. On a side note it is amazing what you can accomplish with a small group of humble engineers who actually care about each other. We spend time together outside of work, we know each others spouses and families and having worked together long enough, we understand how each other thinks.

Throwing It All Away and Starting Over

We learned a lot at Storspeed and knew the underlying technology didn’t facilitate the development of a near real-time analytics platform. I obtained buy in from the founders and engineering team to create a whole new system so I began investigating and selecting technologies.

We started with the premise that 5 second intervals would be our lowest granularity for aggregations and streaming updates.

I wanted to run on the JVM and started looking for a lightweight container and framework to help us rapidly build RESTful web services, after some investigation we selected the Play! framework

We kept Thift and added Zookeeper for distributed coordination. Almost all our data was key/value timeseries bits of JSON so we replaced PostgreSQL with MongoDB and decided to move the aggregation business logic into code.

On the UI side GWT and ActionScript had to go, I like GWT for certain project but it’s not suited for data visualization so we selected Canvas and made use of the jQuery flot plugin as well as Protovis and D3

Selecting the technology and implementing was the easy part.

Some Difficult Questions

We knew there was value in providing NFS streaming analytics but no one at Storspeed or CacheIQ could say exactly how it should be done or why other than “It’s valuable to the customer”.

Potential customers did get excited about enhanced visibility into NFS flow data and several considered purchasing simply for the analytics but we hadn’t really defined the use case for tying NFS analytics directly back into the Storspeed or CacheIQ product.

I have some experience building monitoring and management applications and actually started my career in network engineering so contrary to what some people think, I understand that IT/Ops and DevOps people have a life. A life with partners and kids and bosses and all the things that come with a life, they need easy access to actionable information about the current state of things as they’re constantly context switching and fighting fires.

But how does all that translate into building a product? I mean I couldn’t just throw up a bunch of metrics on the screen, well I could, but that wouldn’t really solve a problem right? And what was the problem exactly I was trying to solve by providing all this data? No one had really answered that either.

I didn’t feel like I could even begin to properly build this product until I had those answers, but it was my problem to solve.

Space to Think

Photo of me taken by @pearce_barry at CacheIQ, a very happy and productive time in my life. RapidView starting to come together in the background.
E33584CF-1638204

I spent a lot of time thinking about this problem and I was given a lot of room to think about it. I don’t just mean that metaphorically either, CacheIQ moved into a new building in Austin and I had my own office with a door where I could close myself off from the rest of the world and spend hours considering what was the point to collecting and aggregating 40Gb/sec worth of NFS traffic statistics.

I know it’s fashionable for startups to put everyone together in a single room to create some sort of “agile” or “XP” environment but programming is and will remain a solitary pursuit and generating a ton of noise and distractions is counterproductive to creating product.

After a lot of thought I found my answer and in hindsight it seems so blatantly obvious. Eventually I realized the question that RapidView had to answer was pretty simple.

How is my Cache Performing?

Now there are some implications to this and you can further extrapolate that to how are my clients being served, what load is being taken off my filers etc, but ultimately it’s answered by showing breakdowns of how my cache is performing. It’s important that you can say in maybe one or two sentences exactly what your product does and why it’s valuable, if not, it’s really difficult to build something cohesive.

With this question (or answer depending on how you look at it) I was able to start to delve into the creative aspects of designing the product. I realized that an effective way to show how the cache was performing was to show totals (throughput or ops) and then used stacked area charts to show ratios of the cache vs. filer across dimensions. Iterating on this I was able to start to envision the product.

Wireframes and Implementations for Cluster View
Cluster_wire

Screen Shot 2011-12-17 at 11.10.57 PM

Wireframes and Implementations for Cluster Drill Down Views
Drill_down

Screen Shot 2012-01-12 at 8.03.52 PM

Now for the Fun Stuff

With these questions answered I was able to dive into the fun and interesting technical challenges of getting the system to scale, perform and behave the way I envisioned. One of the really cool challenges was supporting various levels of granularity or zoom on the fly.

Most analytics applications have fixed aggregations 1 minute, 1hour etc. With RapidView I had a 5 second, 1 minute, and 10 minute aggregate data and I used this data to further aggregate records on the fly on the server as you zoomed in and out in the UI.

Wrapping Up

Eventually CacheIQ met and exceeded their goal of one million IOPS and maxed out the network at over 40Gb/sec of NFS traffic. In early 2012 two other engineers came onboard and helped me complete RapidView which eventually replaced Sysman. In late 2012 CacheIQ was purchased by NetApp.

comments powered by Disqus