Do you want to learn and apply ML algorithms to real data sets? Get your hands dirty cleaning datasets and producing insightful graphics!
Warwick and UCL's data science societies are announcing a collaboration for this summer's research scheme!
Please refer to the repository here.
✅ Build valuable research project experience that you can use for future job/university applications
✅ Boost your CV
✅ Learn a new programming language like Python, R or Julia
✅ Learn a new statistical model and machine learning algorithm
✅ Work closely with experienced data scientists, graduates and PhD students
The scheme is open to any university student interested in cleaning datasets, producing graphics, learning about statistical models, and applying ML algorithms to real datasets.
🏳️ There are no requirements as per subject of your studies - data science skills are valuable tools to master in plenty of work areas!
☢️ Check out the level of complexity of projects! It goes by traffict light colors 🔴 🟡 🟢
The scheme is run by students for students - it intends to be an informal way to conduct a research experience that you may encounter for your dissertation or in other work experiences, it can turn out to be useful for your MSc or job application. We may plan a social event 💃 at Warwick or UCL (or both!) for the scheme's participants to chat about the projects, results and challenges with a glass of cold beer 🍺
As one of WDSS & UCLDSS's summer researchers, you will be immersed in either group or independent work for one of the projects. We offer access to resources and support from experienced data scientists and subject ambassadors with specific domain epxertise.
- run virtually :computer:
- the projects are done in small teams, unless you wish otherwise
- weekly meetings with a supervisor 🦸 a chance to seek guidance and suggestions, catch up with things, show your work and share your thoughts
- produce a final report (with code to reproduce results) which will be published on Warwick and UCL blogs
- there will be the possibility to present your work to YRM 🧑🏫, an informal seminar that gathers young researchers in statistics (no professors involved). This would be entirely optional, but a good opportunity to develop presenting skills in a friendly environment and showcase your work.
💻 Some knowledge in either R or Python.
📚 Basic knowledge in linear regression, algebra, hypothesis testing, exponential distributions.
Here you can find a list of the proposed projects. Feel free to suggest your own reseach question or get in touch to formulate one that suits your preferences!
Text analysis is the process of using computer systems to read and understand text. Text can be in any shape and form - emails, tweets, social media comments, marketing copy, customer support tickets, survey responses.
There are many tasks that can be performed: gaining insights into emotions, moods or opinions using sentiment analysis. For example, a favorable review often contains words like good, fast, and great. However, negative reviews might contain words like unhappy, slow, and bad.
Useful links:
- Introduction to Sentiment Analysis Using Python NLTK Library
- Video: Sentiment Analysis Using NLTK
- Text mining with R
Python is the most widely used programming language today. And, as a library-based language, it is vital to gain familiarity with the most popular and requested libraries. This includes TensorFlow, NumPy, SciPy, Pandas, Matplotlib, Keras, SciKit-Learn, PyTorch, Scrapy, BeautifulSoup.
Useful links:
Artificial neural networks are forecasting methods that are based on simple mathematical models of the brain. They allow complex nonlinear relationships between the response variable and its predictors.
Application areas can lie in predicting solar radiation. The accurate prediction of solar radiation is crucial in both the solar industry and climate research. For example, forecasting the output power of solar systems is required for the good operation of the power grid or for the optimal management of the energy fluxes occurring into the solar system.
Useful links:
- Solar Radiation Prediction Using Different Machine Learning Algorithms and Implications for Extreme Climate Events
- Machine learning methods for solar radiation forecasting: A review
The term spatial statistics refers to the application of statistical concepts and methods to data that have a spatial location attached to them, and in which this locational element is used as an important and necessary part of the analysis.
Its applications lie in exploring geographical determinants relevant for evaluating health intervention programmes or disaster management policies (eg, spatial inequalities within neighbourhoods such as poor road conditions that have an impact on child mortality or higher education outcomes). Another area of application could lie in exploring spatio-temporal dependencies to model climate data. Models that can be used include spatial autoregressive models, GLM, and Bayesian Hierarchical models.
Useful links:
- An Introduction to Spatial Analysis and Mapping in R
- An Introduction to Spatial Data Analysis and Visualisation in R
- Using spatial analysis and GIS to improve planning and resource allocation in a rural district of Bangladesh
- Video: Spatial Statistics in R: An Introductory Tutorial with Examples
There exist numerous ML/DL algorithms that deal with images as input variables and perform different tasks, from reconstruction to detection, classification, etc.
Application areas can lie in the use of medical or satellite images.
Useful links:
- A survey on deep learning in medical image reconstruction
- Medical image analysis
- Deep Learning for Understanding Satellite Imagery: An Experimental Survey
Bayesian methods are becoming increasingly popular in sports analytics. Identified advantages of the Bayesian approach include the ability to model complex problems, obtain probabilistic estimates and predictions that account for uncertainty, combine information sources and update learning as new data become available. The volume and variety of data produced in sports activities over recent years and the availability of software packages for Bayesian computation have contributed significantly to this growth.
Useful links:
- A machine learning framework for sport result prediction
- The Relative Importance of Ability, Luck and Motivation in Team Sports: a Bayesian Model of Performance in Rugby
Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximise the notion of cumulative reward.
Areas of application can lie in algorithmic trading strategies.
Useful links:
Traffic prediction is the task of forecasting real-time traffic information based on floating car data and historical traffic data, such as traffic flow, average traffic speed and traffic incidents; its uses include managing vehicle movement, reducing congestion, and generating the optimal route. The existing ML approaches are the random forest algorithm that creates multiple decision trees and merges their data to obtain accurate predictions and the k-nearest neighbors (KNN) algorithm relies on the principle of feature similarity to predict future values.
Useful links:
- Prediction of Road Traffic Congestion Based on Random Forest
- An Improved K-nearest Neighbor Model for Short-term Traffic Flow Prediction
- Optimized and meta-optimized neural networks for short-term traffic flow prediction: A genetic approach
- A Long Short-Term Memory Recurrent Neural Network Framework for Network Traffic Matrix Prediction
- Road Traffic Prediction Using Artificial Neural Networks
🔴 Hidden Markov models
A Markov chain is a stochastic model that describes a sequence of events (random variables) where the probability of each event depends only on the previous event's state. These random variables can take values from a variety of sets: words, tags, or symbols representing anything, like the weather. In many cases, however, the events we are interested in are hidden: we don’t observe them directly. Eg, we don’t normally observe part-of-speech tags in a text. Rather, we see words, and must infer the tags from the word sequence. A hidden Markov model (HMM) allows us to talk about both observed and hidden events.
Applications that can be explored lie in the field of Natural Language Processing and Financial Time Series.
Useful links:
- A Guide to Hidden Markov Model and its Applications in NLP
- Review on Usage of Hidden Markov Model in Natural Language Processing
- Statistical MArkovian data modelling for Natural Language Processing
- Prediction of financial time series with hidden markov models
- Stock Market Trend Analysis Using Hidden Markov Models
- Hidden Markov Model for Financial Time Series and Its Application to S&P 500 Index
Generative modelling is an unsupervised learning task in machine learning that involves automatically discovering and learning the regularities or patterns in input data in such a way that the model can be used to generate or output new examples that plausibly could have been drawn from the original dataset. GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated).
Application areas can lie in game theory.
Useful links: