 课程大纲:
        
    课程大纲:         Big Data Business Intelligence for Govt. Agencies培训
Each session is 2 hours
        Day-1: Session -1: Business Overview of Why Big Data Business Intelligence in Govt.
        Case Studies from NIH, DoE
        Big Data adaptation rate in Govt. Agencies & and how they are aligning their future operation around Big Data Predictive Analytics
        Broad Scale Application Area in DoD, NSA, IRS, USDA etc.
        Interfacing Big Data with Legacy data
        Basic understanding of enabling technologies in predictive analytics
        Data Integration & Dashboard visualization
        Fraud management
        Business Rule/ Fraud detection generation
        Threat detection and profiling
        Cost benefit analysis for Big Data implementation
        Day-1: Session-2 : Introduction of Big Data-1
        Main characteristics of Big Data-volume, variety, velocity and veracity. MPP architecture for volume.
        Data Warehouses – static schema, slowly evolving dataset
        MPP Databases like Greenplum, Exadata, Teradata, Netezza, Vertica etc.
        Hadoop Based Solutions – no conditions on structure of dataset.
        Typical pattern : HDFS, MapReduce (crunch), retrieve from HDFS
        Batch- suited for analytical/non-interactive
        Volume : CEP streaming data
        Typical choices – CEP products (e.g. Infostreams, Apama, MarkLogic etc)
        Less production ready – Storm/S4
        NoSQL Databases – (columnar and key-value): Best suited as analytical adjunct to data warehouse/database
        Day-1 : Session -3 : Introduction to Big Data-2
        NoSQL solutions
        KV Store - Keyspace, Flare, SchemaFree, RAMCloud, Oracle NoSQL Database (OnDB)
        KV Store - Dynamo, Voldemort, Dynomite, SubRecord, Mo8onDb, DovetailDB
        KV Store (Hierarchical) - GT.m, Cache
        KV Store (Ordered) - TokyoTyrant, Lightcloud, NMDB, Luxio, MemcacheDB, Actord
        KV Cache - Memcached, Repcached, Coherence, Infinispan, EXtremeScale, JBossCache, Velocity, Terracoqua
        Tuple Store - Gigaspaces, Coord, Apache River
        Object Database - ZopeDB, DB40, Shoal
        Document Store - CouchDB, Cloudant, Couchbase, MongoDB, Jackrabbit, XML-Databases, ThruDB, CloudKit, Prsevere, Riak-Basho, Scalaris
        Wide Columnar Store - BigTable, HBase, Apache Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI
        Varieties of Data: Introduction to Data Cleaning issue in Big Data
        RDBMS – static structure/schema, doesn’t promote agile, exploratory environment.
        NoSQL – semi structured, enough structure to store data without exact schema before storing data
        Data cleaning issues
        Day-1 : Session-4 : Big Data Introduction-3 : Hadoop
        When to select Hadoop?
        STRUCTURED - Enterprise data warehouses/databases can store massive data (at a cost) but impose structure (not good for active exploration)
        SEMI STRUCTURED data – tough to do with traditional solutions (DW/DB)
        Warehousing data = HUGE effort and static even after implementation
        For variety & volume of data, crunched on commodity hardware – HADOOP
        Commodity H/W needed to create a Hadoop Cluster
        Introduction to Map Reduce /HDFS
        MapReduce – distribute computing over multiple servers
        HDFS – make data available locally for the computing process (with redundancy)
        Data – can be unstructured/schema-less (unlike RDBMS)
        Developer responsibility to make sense of data
        Programming MapReduce = working with Java (pros/cons), manually loading data into HDFS
        Day-2: Session-1: Big Data Ecosystem-Building Big Data ETL: universe of Big Data Tools-which one to use and when?
        Hadoop vs. Other NoSQL solutions
        For interactive, random access to data
        Hbase (column oriented database) on top of Hadoop
        Random access to data but restrictions imposed (max 1 PB)
        Not good for ad-hoc analytics, good for logging, counting, time-series
        Sqoop - Import from databases to Hive or HDFS (JDBC/ODBC access)
        Flume – Stream data (e.g. log data) into HDFS
        Day-2: Session-2: Big Data Management System
        Moving parts, compute nodes start/fail :ZooKeeper - For configuration/coordination/naming services
        Complex pipeline/workflow: Oozie – manage workflow, dependencies, daisy chain
        Deploy, configure, cluster management, upgrade etc (sys admin) :Ambari
        In Cloud : Whirr
        Day-2: Session-3: Predictive analytics in Business Intelligence -1: Fundamental Techniques & Machine learning based BI :
        Introduction to Machine learning
        Learning classification techniques
        Bayesian Prediction-preparing training file
        Support Vector Machine
        KNN p-Tree Algebra & vertical mining
        Neural Network
        Big Data large variable problem -Random forest (RF)
        Big Data Automation problem – Multi-model ensemble RF
        Automation through Soft10-M
        Text analytic tool-Treeminer
        Agile learning
        Agent based learning
        Distributed learning
        Introduction to Open source Tools for predictive analytics : R, Rapidminer, Mahut
        Day-2: Session-4 Predictive analytics eco-system-2: Common predictive analytic problems in Govt.
        Insight analytic
        Visualization analytic
        Structured predictive analytic
        Unstructured predictive analytic
        Threat/fraudstar/vendor profiling
        Recommendation Engine
        Pattern detection
        Rule/Scenario discovery –failure, fraud, optimization
        Root cause discovery
        Sentiment analysis
        CRM analytic
        Network analytic
        Text Analytics
        Technology assisted review
        Fraud analytic
        Real Time Analytic
        Day-3 : Sesion-1 : Real Time and Scalable Analytic Over Hadoop
        Why common analytic algorithms fail in Hadoop/HDFS
        Apache Hama- for Bulk Synchronous distributed computing
        Apache SPARK- for cluster computing for real time analytic
        CMU Graphics Lab2- Graph based asynchronous approach to distributed computing
        KNN p-Algebra based approach from Treeminer for reduced hardware cost of operation
        Day-3: Session-2: Tools for eDiscovery and Forensics
        eDiscovery over Big Data vs. Legacy data – a comparison of cost and performance
        Predictive coding and technology assisted review (TAR)
        Live demo of a Tar product ( vMiner) to understand how TAR works for faster discovery
        Faster indexing through HDFS –velocity of data
        NLP or Natural Language processing –various techniques and open source products
        eDiscovery in foreign languages-technology for foreign language processing
        Day-3 : Session 3: Big Data BI for Cyber Security –Understanding whole 360 degree views of speedy data collection to threat identification
        Understanding basics of security analytics-attack surface, security misconfiguration, host defenses
        Network infrastructure/ Large datapipe / Response ETL for real time analytic
        Prescriptive vs predictive – Fixed rule based vs auto-discovery of threat rules from Meta data
        Day-3: Session 4: Big Data in USDA : Application in Agriculture
        Introduction to IoT ( Internet of Things) for agriculture-sensor based Big Data and control
        Introduction to Satellite imaging and its application in agriculture
        Integrating sensor and image data for fertility of soil, cultivation recommendation and forecasting
        Agriculture insurance and Big Data
        Crop Loss forecasting
        Day-4 : Session-1: Fraud prevention BI from Big Data in Govt-Fraud analytic:
        Basic classification of Fraud analytics- rule based vs predictive analytics
        Supervised vs unsupervised Machine learning for Fraud pattern detection
        Vendor fraud/over charging for projects
        Medicare and Medicaid fraud- fraud detection techniques for claim processing
        Travel reimbursement frauds
        IRS refund frauds
        Case studies and live demo will be given wherever data is available.
        Day-4 : Session-2: Social Media Analytic- Intelligence gathering and analysis
        Big Data ETL API for extracting social media data
        Text, image, meta data and video
        Sentiment analysis from social media feed
        Contextual and non-contextual filtering of social media feed
        Social Media Dashboard to integrate diverse social media
        Automated profiling of social media profile
        Live demo of each analytic will be given through Treeminer Tool.
        Day-4 : Session-3: Big Data Analytic in image processing and video feeds
        Image Storage techniques in Big Data- Storage solution for data exceeding petabytes
        LTFS and LTO
        GPFS-LTFS ( Layered storage solution for Big image data)
        Fundamental of image analytics
        Object recognition
        Image segmentation
        Motion tracking
        3-D image reconstruction
        Day-4: Session-4: Big Data applications in NIH:
        Emerging areas of Bio-informatics
        Meta-genomics and Big Data mining issues
        Big Data Predictive analytic for Pharmacogenomics, Metabolomics and Proteomics
        Big Data in downstream Genomics process
        Application of Big data predictive analytics in Public health
        Big Data Dashboard for quick accessibility of diverse data and display :
        Integration of existing application platform with Big Data Dashboard
        Big Data management
        Case Study of Big Data Dashboard: Tableau and Pentaho
        Use Big Data app to push location based services in Govt.
        Tracking system and management
        Day-5 : Session-1: How to justify Big Data BI implementation within an organization:
        Defining ROI for Big Data implementation
        Case studies for saving Analyst Time for collection and preparation of Data –increase in productivity gain
        Case studies of revenue gain from saving the licensed database cost
        Revenue gain from location based services
        Saving from fraud prevention
        An integrated spreadsheet approach to calculate approx. expense vs. Revenue gain/savings from Big Data implementation.
        Day-5 : Session-2: Step by Step procedure to replace legacy data system to Big Data System:
        Understanding practical Big Data Migration Roadmap
        What are the important information needed before architecting a Big Data implementation
        What are the different ways of calculating volume, velocity, variety and veracity of data
        How to estimate data growth
        Case studies
        Day-5: Session 4: Review of Big Data Vendors and review of their products. Q/A session:
        Accenture
        APTEAN (Formerly CDC Software)
        Cisco Systems
        Cloudera
        Dell
        EMC
        GoodData Corporation
        Guavus
        Hitachi Data Systems
        Hortonworks
        HP
        IBM
        Informatica
        Intel
        Jaspersoft
        Microsoft
        MongoDB (Formerly 10Gen)
        MU Sigma
        Netapp
        Opera Solutions
        Oracle
        Pentaho
        Platfora
        Qliktech
        Quantum
        Rackspace
        Revolution Analytics
        Salesforce
        SAP
        SAS Institute
        Sisense
        Software AG/Terracotta
        Soft10 Automation
        Splunk
        Sqrrl
        Supermicro
        Tableau Software
        Teradata
        Think Big Analytics
        Tidemark Systems
        Treeminer
        VMware (Part of EMC)
 
     
     
         
     加入高级会员获得助教答疑
 加入高级会员获得助教答疑 
                