The Eightfold Path to Data Analytics Enlightenment
The Buddhist Eightfold Path offers eight guides to achieve spiritual enlightenment and cease suffering. With due respect to Buddhism, I offer eight paths to reach an enhanced, enlightened state for your data analytics efforts–a state that eases suffering, and provides deeper, more impactful experiences, both today, and with ever-increasing data volumes.
We hear a great deal about big data and the exponential growth in data resulting from social media, email, Twitter, machine logs, and IoT devices. Recent estimates from IBM state that 90 percent of the data in existence today is less than two years old and that we are creating 2.5 quintillion bytes of data per day.
The hard truth is that today’s data analytic systems are struggling just to keep up with their traditional, structured data streams. And they have not yet begun to leverage the exponential growth in semi-structured data.
Legacy issues caused by an overload of traditional batch Extract, Transform, and Load-Data Warehouse-Business Intelligence (ETL-DW-BI) architectures for structured data analytics include:
- Slow data loading and data transformation;
- Multiple data stages from landing to staging to target, with slow data movement and costly replication;
- Heterogeneous and uncooperative database formats;
- Long delays in adding and leveraging new data; and
- Brittle target data models that support existing reporting, but limit addition of new reports and analytics.
But it is possible to ease this suffering and achieve a higher level of data analytics capability–perhaps even nirvana. Herewith, the Eightfold Path to Data Analytics Enlightenment:
- Share the load. Disrupt legacy IT ecosystems with distributed processing frameworks. Reduced costs and performance gains of distributed clusters are undeniable. The intelligence community was an early adopter of distributed processing and an active contributor to the Apache Hadoop ecosystem. Overloaded data ingestion, ETL, and transformation streams are now being improved with distributed Apache Spark solutions at the Defense Intelligence Agency, Centers of Medicare and Medicaid Services, and the Department of Homeland Security.
- Jump in the Lake. Exploit emerging Data Lake (NoSQL) architectures in harmony with structured data stores–to land and exploit more data, more quickly. Early euphoria around HDFS and NoSQL stores improperly proposed replacing the data warehouses. The best practice is to complement your data warehouse with Data Lakes. This architecture rapidly lands data that users explore with serverless database query services, using “schema-on-read” tools like AWS Athena, without building metadata. The DHS Neptune Data Framework implements this design pattern to speed time to insight and feed more data stores in classified settings.
- Lease Cloud Services. Leverage the growing collection of cloud Platform-as-a-Service (PaaS) offerings in addition to cloud IaaS to jump-start your analytics. Many analytic cloud services are elastic, lowering overall cost when not in use. Services include the core services needed for a traditional ETL-DW-BI solution including managed relational databases, NoSQL repositories, and data warehouses; serverless data preparation and data query; and BI reporting and dashboards. FINRA uses AWS analytic cloud services to ensure the integrity of financial markets and to protect investors. And 17 intelligence agencies use C2S, an AWS cloud platform that provides storage, compute, and elastic analytic services.
- Move up the Analytics Value Chain. Extend your reporting and dashboards with predictive analytics to forecast outcomes, embed analytics in decision-making processes, and apply prescriptive analytics to initiate actions for more desirable outcomes. Federal agencies that distribute large benefits have successfully used both predictive and prescriptive analytics to detect and prevent fraud: IRS prevents income tax fraud, CMS detects Medicare insurance fraud, and USDA SNAP reduces food benefit trafficking. All agencies can benefit from these tools for fraud detection, budget forecasting, financial management, safety, risk assessment, compliance, and scientific research.
- Provision and Empower Everyone. Invest in self-service data preparation, data analysis, and data visualization tools to reduce the time to insight. The days of large, specialized teams of ETL developers, data modelers, data architects, data scientists, and BI developers are dwindling. Agencies need to provision business analysts, executives, data analysts, researchers, and subject matter experts of all abilities. Users need access to data and tools–according to their abilities and need to know. This pervasive leverage of data resources will result in significant mission impacts. The U.S. Census Center for Applied Technology created such a culture of data exploration and entrepreneurs.
- Curate your Data. Adopt data governance policies and data management tools to capture, share, and leverage metadata. Support informal exploration of your Data Lake resources with tools that track lineage and manage access–but do not impose structure because different users may impose a different schema for a different analytical purpose. DHS, law enforcement, and the intelligence community all use rigorous data lineage to ensure data veracity in legal, investigative, and security operations.
- Discover Your Data. Supply your team with discovery tools that leverage statistical analysis, artificial intelligence, data visualization, and geospatial display to find patterns and meaning in data. Exploratory analysis by end-users can identify interesting patterns that extend BI reports and dashboards. A new category of automated discovery tools is emerging with the potential to find patterns of value automatically. Extend your ETL-DW-BI platform with these Modern BI tools. FAA, VA, DHS, and U.S. courts all use modern BI tools like Tableau, Microsoft PowerBI, and QlikView to explore and visualize data.
- Commit to Open Source. Embrace Open Source tools like Apache Hadoop and Spark that have created disruptive improvements to data analytics. For example, Accumulo provides high-speed access to big data stores. Adopt open source data science algorithms and packages like R and Python, and data platforms. Become active in the open source community; your agency can even launch its own open source project like the NSA did with Accumulo.
These eight paths may improve the data analytics suffering seen in large ETL-DW-BI projects. No project needs to adopt them all to achieve improvements, but the adoption of any path will benefit every project. Each of these eight paths offers opportunities to improve performance, optimize, innovate, and achieve real mission impacts with data analytics.