MIS772 Predictive Analytics (2019 T1) Individual Assignment A2-LP4 / All Workshops Assignment A2: Text Mining + DT + Neural Nets + Optimisation Student Name (as per record) Student No Student number My other group members A2 Group No As per CloudDeakin group number Student Name (as per record) Student Nos Student number Student number Student number Exceptional Meets expectations Issues noted Improve Unacceptable Exec Report Create Models Evaluate & Improve Provide Solution Research & Extend Brief Comments Read these notes as we are really trying to help you out! Remember: If it is not in this report, it does not exist and does not get marked! You can use the above form to estimate the expected mark against the rubric (see the assignment “info” document). Be realistic and note that we will find many problems you may not be aware of. Assume that markers may be tired when assessing your work and they may miss some important aspects of your submission when not presented clearly, or when you deviate from the structure of this template, or if you do not include them in your report. So be clear, number all tables, charts and screen shots used as evidence, describe all visuals, cross-reference your analysis with evidence. Submit this report in PDF format to avoid accidental reformatting of the content. Submit all RapidMiner processes (.RMP files) in a separate ZIP archive, so that if there is any doubt we could load your work and replicate your results (we will not do this to find missing report parts). Ensure that the report is readable and the font is no smaller than Arial 10 points. In the report include only the most significant results for your analysis and recommendations. You will be able to submit your work once only so make sure you get it right – check these before posting on CloudDeakin: Is this your document? Is this the correct unit, assignment, year and trimester? Is your name entered above? Is the group number included and is it correct? Are names of your group members entered as well? Are all pages included? Does it all fit into the required page limit? Have you zipped all RapidMiner files (.RMP files)? Is the report contents yours alone? Then after the submission – check these: Has the PDF report been submitted? Has the Zip archive of RMP files been submitted? Can you retrieve and reopen both back from your submission folder? Note that the late penalty will be calculated on the date and time of the last submitted file. Finally, as all reports will be inspected for plagiarism, ensure that your analysis, your evidence, your way of thinking, your report and its presentation are unique and demonstrate your ability to create it all independently. So if you work in a team compare your submission to those of your team members and make it quite distinct in both contents and form. Any part of this report that bears any resemblance to another students’ report or any information source written by others or by you for another unit (e.g. on the web) will be treated as plagiarism. Total Include: Report and RMP files, with clear comments supplied to (easily) reproduce reported results. Executive summary (one page) Expectation Business Problem Restate / Redefine a business problem from the previous stage LP3. Solution to Business Problem Succinctly describe the solution and justify it. Provide references to the supporting evidence, e.g. charts and plots from the following sections. Extension Narrow down the business problem. Identify decisions and actions that will be supported by the analytic solution. Include a list of used academic refs.. Hints For clarity and to guide the markers, leave the words “Expectation” and “Extension”. Do not attempt the extension unless the main objective has been achieved. If not attempting the extension then delete the extension section, together with the word “Extension”. Use this font for your report, i.e. Arial 10 (and not MyTiniestFont 2). Make sure your exec summary is very clear . You can restate or rephrase the problem statement from the previous assignment stage. Do not invent your own problem – it has been given to you but may not be achievable in its current form. Ensure that whatever problem you describe can be solved using the provided data. Make sure the exec summary describes the problem from the business perspective . Do not use technical or scientific terms and concepts as the managers may not understand them. Use business language and not computer / mathematical / statistical / data science language. The problem statement should describe the high level aims and not the methods of their achieving. Think and state the likely benefit of this project for the company and its management. Think and state who the company clients are and what the likely benefits of this project are for them. Ensure that your solution clearly matches the problem statement. Ensure that the solution is formulated in terms of achieving the high-level aim. Do not include any charts or tables in this section – just cross-references with other sections . So, use appropriate cross-references between your problem/solution statement and tables or charts from the following sections, e.g. you can refer to them as “… (see Figure 1)” or “As shown in Table 4…”. If you need to support your statements / analysis / argument with references to any published materials, use Harvard citation style as described in: http://www.deakin.edu.au/students/studying/study-support/referencing/harvard. As the executive summary should not take even one page, we suggest to include your bibliographic references at the bottom of this page, immediately below the executive summary (or problem description). All comments, such as this, which are not part of your submission should be deleted to save space. Create a Model(s) in RapidMiner (two pages / page 1) Expectation Create at least these two models, i.e. (M1) decision trees (or GBTs / Random Forests) and (M2) neural nets . Ensure you consider three types of solutions, which are based on (A1) structured data only , (A2) text data only , (A3) a mix of structured and text data . Describe and justify the operators’ parameters. Extension Analyse and eliminate anomalies . Rely on your data clustering . Use PCA to visualise clusters and anomalies. Create and use a composite model ensemble (which has its own structure, e.g. voting, bagging, stacking, or another which is not provided as an “atomic” operator). Answer the management question (B) . Hints Include here screenshots of all or parts of the RM process. Describe and justify the most important operator parameters. If analysis or results could only be determined by inspecting the process or running it, the marks will be reduced. For clarity and to guide the markers, leave the words “Expectation” and “Extension”. Do not attempt the extension unless the main objective has been achieved. If not attempting the extension then delete the extension section, together with the word “Extension”. Use this font for your report, i.e. Arial 10 (and not MyTiniestFont 2). Create a Model(s) in RapidMiner (two pages / page 2) This is a continuation of the section reported in the previous page. All comments, such as this, which are not part of your submission can be deleted to save space. Evaluate and Improve the Model(s) in RapidMiner (two pages / page 1) Expectation Use cross-validation , e.g. 5-fold (note that training of neural networks may take a long time). Optimise the models to minimise overall error in ratings. Tabulate performance of all models (including ensembles if used), using R2, correlation and others. Identify the best performing model and justify its selection. Extension Use systematic grid optimisation of the models’ hyper-parameters. Visualise results of grid optimisation. Hints Include here the text of your analysis with tables and charts, and screenshots of parts of the RM process. If you experimented with different model parameters, describe the results of these experiments. If analysis or results could only be determined by inspecting the process or running it, the marks will be reduced. For clarity and to guide the markers, leave the words “Expectation” and “Extension”. Do not attempt the extension unless the main objective has been achieved. If not attempting the extension then delete the extension section, together with the word “Extension”. Use this font for your report, i.e. Arial 10 (and not MyTiniestFont 2). Evaluate and Improve the Model(s) in RapidMiner (two pages / page 2) This is a continuation of the section reported in the previous page. All comments, such as this, which are not part of your submission can be deleted to save space. Provide an Integrated Solution in RapidMiner (one page) Expectation Create a quality deployment process using your best model with the optimum parameters. Score the new data with the selected model and discuss results. Explain how to prepare data and use the results in practice, e.g. show what data the user received, how was the data prepared, how was the model applied to the data, what results were obtained and what conclusions would the user draw from these reports. Extension Ensure all pre-processing and predictive models, as well as, word lists, weights and PCA models are saved during optimisation and then retrieved and applied in deployment. Hints Include here the text of your analysis with tables and charts, and screenshots of parts of the RM process. For the extension, make sure that you save your pre-processing data models, your predictive models, word lists and variable selections, as they were developed in the process of training, validation and optimisation; so that you could then read and reapply these pre-processing steps and models to completely new data in production. Note that the data used in practice would not include the label attribute as this is what would be missing and only your system could predict or estimate these values. It is also up to you to decide if you’d required data to be cleaned up in any way by those preparing the data for scoring. If analysis or results could only be determined by inspecting the process or running it, the marks will be reduced. For clarity and to guide the markers, leave the words “Expectation” and “Extension”. Do not attempt the extension unless the main objective has been achieved. If not attempting the extension then delete the extension section, together with the word “Extension”. Use this font for your report, i.e. Arial 10 (and not MyTiniestFont 2). Further Research and Extensions in RM (one page) Expectation When working in analytics you always need to enhance your skills by self-study. Extend your work with RM features beyond what was covered in class, to improve the model, use novel visualisations, or analyse results in the most effective way. Alternatively, conduct independent research to assess and contextualise your results. For example, you may wish to perform some additional data analysis using RapidMiner, or using some other tools such as R or Python extensions or some other analytics/data mining tools. However ensure that all the expected work in above sections is done using RapidMiner. If you decided to use other tools for this section only, your submission should also include their project scripts (e.g. include Python or R scripts in a ZIP archive together with the RMP files). Please, do not use Excel , which is not a data mining tool! Alternatively, conduct independent research in the area related to the analysed data set to determine if your predictions are able to confirm or extend previously published results. Extension Surprise us with your insights. Ensure to use a few academic references in this section. However due to the shortage of space, you may wish to include the list of references in the executive section (above). Hints Include here the text of your analysis with tables and charts, and screenshots of parts of the RM process. If your parts of your research included R or Python, attach the code as part of your submission’s ZIP archive. If analysis or results could only be determined by inspecting the process or running it, the marks will be reduced. For clarity and to guide the markers, leave the words “Expectation” and “Extension”. Do not attempt the extension unless the main objective has been achieved. If not attempting the extension then delete the extension section, together with the word “Extension”. Use this font for your report, i.e. Arial 10 (and not MyTiniestFont 2). Any materials, analysis or reports that do not fit into 8 (eight pages in total, including the front page) will not be looked at or marked. 7 of 8