The enormous volumes of data are only going to increase.
Broadband companies are already working with enormous volumes of data. Moving forward, those volumes are only going to increase, and to deal with it all, service providers are going to need to gain a more thorough understanding of how to handle “Big Data.”
With the dramatic increase in the consumption of Internet-based rich media content, the broadband industry is gradually making the transition to usage-based services in order to improve the subscriber experience, offer a wider range of revenue-enhancing services and better align service delivery costs with actual usage.
Deploying these new services requires the collection, processing and storage of vast amounts of detailed subscriber, network and application information. The sheer volume of such data is overwhelming the capabilities of traditional information processing and storage systems, particularly SQL-based relational database management systems that weren’t designed to handle hundreds of terabytes or petabytes of data.
Fortunately, companies such as Amazon, Facebook and Google have pioneered the adoption of Big Data technologies to manage the massive amounts of information they collect from hundreds of millions of users. These same technologies are being applied successfully by broadband providers to collect, store, analyze and utilize subscriber usage data. Big Data technologies can provide real-time, subscriber-centric visibility into network usage, network condition and subscriber experience, ushering in a new era of dynamic broadband service management.
Relational database shortcomings
Relational databases are designed for a wide range of queries and accomplish this by storing metadata along with the actual data. The metadata is organized to facilitate flexible extraction of data and needs to be written to disk as data is stored. In some cases, there may be 200 to 300 bytes of metadata for every 100 bytes of data, and for massive volumes of data, this can double or triple the amount of storage required. More critically, writing the metadata to disk negatively impacts performance.
Big Data technologies were developed to address these and other shortcomings of traditional relational databases. Typical database software tools are insufficient to collect, store, manage and analyze huge datasets ranging from hundreds of terabytes to petabytes.
Two characteristics of Big Data are volume and velocity. Volume refers to the massive amount of data that needs to be stored and analyzed, while velocity indicates that data typically flows into the data management system rapidly and needs to be analyzed quickly and frequently.
In 2008, Facebook open-sourced technology now managed by the Apache Cassandra Project and worked on by many contributors from around the world. Cassandra is a “NoSQL” – a highly scalable, distributed data management system that scales linearly with the size of the dataset and delivers proven high availability using commodity hardware infrastructure. Cassandra employs a structured, key-value store data model that enables read and write throughput to increase linearly as the dataset grows and new machines are added. Cassandra also includes its own query language, called CQL (Cassandra Query Language).
The advantages of a Big Data system like Cassandra include:
- Greatly improved write performance because the system is not writing metadata to disk
- Highly optimized read performance due to key-value storage model, in which keys remain sorted on disk to facilitate contiguous reads of data
- Incremental “pay as you grow” linear system scaling
- Leverages commodity processing and storage hardware.
Opportunities and challenges
What types of Big Data challenges do broadband providers face, and how are they leveraging this new technology?
One critical need is accurately collecting and recording detailed subscriber usage and network performance data. The DOCSIS standard specifies the IP Detail Record (IPDR) protocol as a standard mechanism to collect network and subscriber usage and performance data. IPDR is gaining popularity with cable operators as a highly efficient and scalable means of measuring subscriber usage and activity in broadband networks.
A large broadband provider could have millions of subscribers, each being served by their own cable modem. Using IPDR to track usage and performance, each of the cable modems will periodically generate records – approximately every 10 to 20 minutes – in order to provide accurate and timely usage information. This results in millions of IPDRs being collected, processed and stored every hour; tens of millions every day; hundreds of millions every week; and billions over the course of a year. As a result, the data management system must be able to efficiently handle hundreds of terabytes of data.
Broadband providers can use a Big Data storage system like Cassandra to continuously store this massive volume of detailed per-subscriber usage data, which can also be retained over long periods of time, enabling historical analysis of subscriber and network usage trends. This is done economically by using commodity server processing and storage infrastructure that is simply expanded as the volume of data grows.
Applying usage data
Once broadband providers have collected all of this usage data, how can they turn it into useful information to better serve subscribers, grow revenues and manage their business more profitably?
In the early days of broadband deployment, the initial focus was simply bringing subscribers online and increasing the speed of their connections. Now, with hundreds of millions of broadband subscribers worldwide, the focus is shifting to understanding how subscribers are using broadband and more effectively managing the network infrastructure under increasing load and periods of heavy congestion. This is driven by the dramatically increased consumption of rich media content, including bandwidth-intensive applications such as electronic gaming and streaming video. In a very real sense, broadband is becoming a victim of its own success.
In this new era, the first requirement is for broadband providers to be able to analyze network usage for capacity planning purposes. It’s essential they understand exactly how the network is being used so network engineers can make intelligent decisions about where to increase network capacity. In cable networks, many congestion and performance problems are manifested in the local distribution network, so it’s critical to have a fine-grained view of network performance down to the level of individual subscribers. This isn’t possible unless usage information is being continuously collected at that level of granularity. Furthermore, this type of analysis may require examining network performance over relatively long periods of time, which means it’s crucial to be able to record historical usage data.
Another requirement for broadband providers is managing network performance during times of heavy load and congestion. By combining per-subscriber policies with network performance data, it’s possible to implement dynamic traffic management controls to throttle traffic flows for a group of subscribers when their segment of the network becomes congested, so that each subscriber gets their fair share of the available bandwidth based on policy information.
Note that subscribers may have different policies based on the type of services they are entitled to under their service agreements. Some subscribers may elect to pay a premium so they are assured better throughput at times when the network is congested. However, for any of this to work, providers must have rapid access to accurate, current network usage data.
Services for all
While there is a clear trend toward increased consumption of rich media content, there is also a growing broadband usage disparity between typical users and power users. Existing flat-rate broadband billing models assume all subscribers will use some reasonable amount of network capacity per month, but this is becoming problematic as heavy users push the limits of “reasonable” use and are viewed as miscreants for exceeding these limits.
Furthermore, light users of broadband are effectively subsidizing the heavy users, and under the flat-rate billing model have no option to pay only for what they actually use. This is driving broadband providers to offer usage-based services, in which subscribers pay for what they actually use, but to deploy these requires the collecting of accurate and timely per-subscriber usage data.
Product managers at broadband providers can use business intelligence tools to analyze network and subscriber usage data to gather the information needed to plan and design new types of service offerings.
This type of analysis typically requires extracting various kinds of historical data from the Big Data system in order to see the actual distribution of broadband consumption patterns by types of subscribers over time. While there may be statistical methods that yield approximate results, having a detailed historical record of actual subscriber usage data will yield better results.
For instance, a broadband provider may see that a large number of power users tend be active during the evening busy hours and design a new service that offers them assured network performance at those times, but at a premium price. Detailed usage data is also critical for knowing where to draw the line for different tiers of usage-based services. Big Data analysis might show that there are multiple clusters of subscribers centered around certain usage levels, which would then be used to set usage quotas for each tier.
As we have seen, there are many possible uses for collecting and storing extensive, historical, subscriber-centric broadband usage data. Using Big Data technologies, broadband providers now have the ability to offer a new generation of dynamic broadband services that better meet the needs of subscribers. They also have the necessary information to make intelligent decisions about network capacity planning and to design new service offerings.