As time goes by and technology develops, new words and terms enter circulation, often without knowledge of who created them and often (in my experience), without knowledge of what they mean, despite everyone using them. The term ‘Data scientist’ is one of these. In the following I will try to add clarity to what is now an in-demand profession, a career path and a discipline and give suggestions, from the literature, on how to approach to teaching data science...
If data science were a Venn diagram it would be the bit in the middle, at the intersection of ‘computing and data skills’, ‘maths and statistics’ and ‘substantive expertise’1. If we assume ‘substantive expertise’ is domain or background knowledge and park this component here, then data science can be distilled into computing skills and statistics. However, data science is not limited to use by computer scientists and statisticians and it is increasingly apparent that data science skills are necessary for research and employment in our data-centric world.
George Less, Chief Information officer at Goldman Sachs believes ‘90% of the world’s data have been created in the last two years’2. Data acquisition is no longer the rate-limiting step but the people to process it. In accordance with Moore’s law, computing power is vastly different to what it was. One paper warns ‘the shortage of data scientists is becoming a serious constraint in some sectors’3 and highlights the impact this has on industry looking to capitalise on big data. The chief economist at Google, Hal Varian described statisticians as ‘the sexy job in the next 10 years’, but also twinned with their ability to write. The McKinsey report highlights the shortage of new graduates with ‘deep analytical skills’4, creating a bottleneck in the pipeline to industry and exposing the ‘need for data saavy citizenary’1.
The American website Glassdoor showed from approximately 10,000 data scientist jobs in 2017, skills in using Python, R and SQL were mentioned in 74%, 64% and 51% of job postings respectively5. These constitute 3 of the most popular programming languages used by data scientists. Beyond requiring a computer language, the consensus from the literature is of the wealth of topics data scientists should be exposed to include data cleaning or ‘tidying’, visualisation and description, storage and analyses, modelling, design and bias.
However this demand is top down and hasn’t quite properly infiltrated down enough to make changes at ground level. Data science ‘has no natural home’1 in education. Statistics education is still tied up in approaches used to try and ‘circumvent their lack of computational power’1; which is no longer an issue. It is in need of reform. It is recommended statistics enter the curriculum more frequently and more importantly, earlier, so it can be built on incrementally1. Despite the number of undergraduates completing statistics degrees growing faster than any other STEM discipline6, statistics and data science are used in many contemporary disciplines, seldom with formal training7. As such, scientists would benefit from basic data science ‘soft’ skills16, rather than ‘their own bespoke workarounds to keep pace’7. Rather than becoming fully qualified programmers, data scientists are most keen on simple programming that will answer their questions, called ‘scripting’8.
In terms of who should create this curriculum shift, candidates include academia, industry (and industry’s expectation of graduates, for example), stats associations and funding bodies (having previously funded the high performance computing infrastructure being used to pump out data)9. In terms of who you are learning alongside, a community of practice among data scientists is important to facilitate staying up-to-date and potentially motivated. There is an extensive community discussion across twitter, blogs (rbloggers.com, blog.rstudio.org and stackoverflow.com) and, for those in need of a break from the screen, conferences. In a study of over 1000 staff, students and researchers at major biological meetings over 4 years, the National Science Federation (NSF) found 94% currently were (or were due to be) using large data, despite 47% rating themselves as ‘beginner’ in terms of bioinformatics skills and 58% felt their institutions did not provide adequate computational resources9. Further, in a survey of 704 biological science PIs in 2016, 90% said they were or would be using big data and their biggest need was not hardware but training and the knowledge on how to use big data effectively9. Combing data from different sources and platforms was identified as one of the biggest issues; yet this is fundamental in the use of such big or meta - data to address contemporary questions (e.g. genotype, phenotype, environmental interactions)9. With the movement toward open science (and data ‘stewardship’ not ownership), the general public can further provide feedback and support when research outputs are made fully available10.
When launching a an ‘Introduction to data science’ course with no pre-requisites at the University of Illinois, the lead identified the major challenge was getting the balance and motivation right for students coming with heterogeneous programming backgrounds, due to its popularity across disciplines11. In trying to strike this balance, they highlighted that the exercises they set were less likely to motivate the more advanced students.
Another major barrier to teaching data science is maintaining interest. This is important as motivated students are more successful learners12 and put in more effort. A re-occurring solution to address this was the importance of using real-world, relevant and applicable examples. Text book examples can appear tired, dated and far too naturally-normally-distributed. More contemporary approaches include getting students to create code to identify spam email or model the energy use in their halls of residence13 or highlight the importance of data science for cancer care14. Going a step beyond this, one author advocates the importance of encouraging students to collect their own data to demonstrate that real datasets don’t always comply with assumptions for statistics tests and that this has to be dealt with15. Perhaps born out of contemplating the billion-dollar gaming industry, another approach is teaching through gamification. Coding games or competitive programming have been adopted to motivate students and facilitate learning in an approach similar to hiding veggies in their dinner. Having an interesting problem worth solving is not only important for maintaining interest but also for recruitment trying to appeal to potential employees3,16.
Given the volumes of information out there on the mega-volumes of data an unsurprising barrier to use of data science tools is confidence. In an overhaul of their research workflow, one department found that implementing version control and being able to easily revert to previous files reduced apprehension amongst staff for submitting new or imperfect code and helped make progress7. The vast amount of data out there also highlights a larger issue of finding and securing resources with which to teach data science. The team at Illinois University almost entirely draw their software, references and data from freely-available, online material but mentioned this made it hard to keep the consistency of the course the same and the level of difficulty11.
Providing feedback and assessment was identified as another issues when trying to teach data science as broadly as possible (e.g. online). Successful approaches to assessment in scaled-up versions of courses have been use of peer assessment and ‘automated’ machine grading11,13. And lastly, having got the students attention an issue that never fails to interrupt us; the inescapable technical issues. Those mentioned from the teaching community include issues with Windows, installing libraries and debugging. Happy troubleshooting!
Acknowledgements
Thanks to Edinburgh University's Dr. Jill Mackay for her support and her advice initiating this review.
Comments
Nice commentary about the
Nice commentary about the problem at high level and people needed to analyse volume of data. It would be interesting to look in more detail about the type of data we record and when we record it I.e. do we need to be choosy in future about what we record ?
Thank you for sharing useful
Thank you for sharing useful information for all the candidates of Data Science Training who want to kick start this career in this field.
Add a comment