 课程大纲:
        
    课程大纲:         Data Science for Big Data Analytics培训
Introduction to Data Science for Big Data Analytics
        Data Science Overview
        Big Data Overview
        Data Structures
        Drivers and complexities of Big Data
        Big Data ecosystem and a new approach to analytics
        Key technologies in Big Data
        Data Mining process and problems
        Association Pattern Mining
        Data Clustering
        Outlier Detection
        Data Classification
        Introduction to Data Analytics lifecycle
        Discovery
        Data preparation
        Model planning
        Model building
        Presentation/Communication of results
        Operationalization
        Exercise: Case study
        From this point most of the training time (80%) will be spent on examples and exercises in R and related big data technology.
        Getting started with R
        Installing R and Rstudio
        Features of R language
        Objects in R
        Data in R
        Data manipulation
        Big data issues
        Exercises
        Getting started with Hadoop
        Installing Hadoop
        Understanding Hadoop modes
        HDFS
        MapReduce architecture
        Hadoop related projects overview
        Writing programs in Hadoop MapReduce
        Exercises
        Integrating R and Hadoop with RHadoop
        Components of RHadoop
        Installing RHadoop and connecting with Hadoop
        The architecture of RHadoop
        Hadoop streaming with R
        Data analytics problem solving with RHadoop
        Exercises
        Pre-processing and preparing data
        Data preparation steps
        Feature extraction
        Data cleaning
        Data integration and transformation
        Data reduction – sampling, feature subset selection,
        Dimensionality reduction
        Discretization and binning
        Exercises and Case study
        Exploratory data analytic methods in R
        Descriptive statistics
        Exploratory data analysis
        Visualization – preliminary steps
        Visualizing single variable
        Examining multiple variables
        Statistical methods for evaluation
        Hypothesis testing
        Exercises and Case study
        Data Visualizations
        Basic visualizations in R
        Packages for data visualization ggplot2, lattice, plotly, lattice
        Formatting plots in R
        Advanced graphs
        Exercises
        Regression (Estimating future values)
        Linear regression
        Use cases
        Model description
        Diagnostics
        Problems with linear regression
        Shrinkage methods, ridge regression, the lasso
        Generalizations and nonlinearity
        Regression splines
        Local polynomial regression
        Generalized additive models
        Regression with RHadoop
        Exercises and Case study
        Classification
        The classification related problems
        Bayesian refresher
        Naïve Bayes
        Logistic regression
        K-nearest neighbors
        Decision trees algorithm
        Neural networks
        Support vector machines
        Diagnostics of classifiers
        Comparison of classification methods
        Scalable classification algorithms
        Exercises and Case study
        Assessing model performance and selection
        Bias, Variance and model complexity
        Accuracy vs Interpretability
        Evaluating classifiers
        Measures of model/algorithm performance
        Hold-out method of validation
        Cross-validation
        Tuning machine learning algorithms with caret package
        Visualizing model performance with Profit ROC and Lift curves
        Ensemble Methods
        Bagging
        Random Forests
        Boosting
        Gradient boosting
        Exercises and Case study
        Support vector machines for classification and regression
        Maximal Margin classifiers
        Support vector classifiers
        Support vector machines
        SVM’s for classification problems
        SVM’s for regression problems
        Exercises and Case study
        Identifying unknown groupings within a data set
        Feature Selection for Clustering
        Representative based algorithms: k-means, k-medoids
        Hierarchical algorithms: agglomerative and divisive methods
        Probabilistic base algorithms: EM
        Density based algorithms: DBSCAN, DENCLUE
        Cluster validation
        Advanced clustering concepts
        Clustering with RHadoop
        Exercises and Case study
        Discovering connections with Link Analysis
        Link analysis concepts
        Metrics for analyzing networks
        The Pagerank algorithm
        Hyperlink-Induced Topic Search
        Link Prediction
        Exercises and Case study
        Association Pattern Mining
        Frequent Pattern Mining Model
        Scalability issues in frequent pattern mining
        Brute Force algorithms
        Apriori algorithm
        The FP growth approach
        Evaluation of Candidate Rules
        Applications of Association Rules
        Validation and Testing
        Diagnostics
        Association rules with R and Hadoop
        Exercises and Case study
        Constructing recommendation engines
        Understanding recommender systems
        Data mining techniques used in recommender systems
        Recommender systems with recommenderlab package
        Evaluating the recommender systems
        Recommendations with RHadoop
        Exercise: Building recommendation engine
        Text analysis
        Text analysis steps
        Collecting raw text
        Bag of words
        Term Frequency –Inverse Document Frequency
        Determining Sentiments
        Exercises and Case study
 
     
     
         
     加入高级会员获得助教答疑
 加入高级会员获得助教答疑 
                