Skip to content

This research scheme is run by the Warwick Data Science Society for the period of Term 1, academic year of 22/23

License

Notifications You must be signed in to change notification settings

warwickdatasciencesociety/2022-Term1-Research-Scheme

 
 

Repository files navigation

header

Do you want to learn and apply ML algorithms to real data sets? Get your hands dirty cleaning datasets and producing insightful graphics!
Warwick and UCL's data science societies are announcing a collaboration for this summer's research scheme!


Please refer to the repository here.

Why take part?

✅ Build valuable research project experience that you can use for future job/university applications
✅ Boost your CV
✅ Learn a new programming language like Python, R or Julia
✅ Learn a new statistical model and machine learning algorithm
✅ Work closely with experienced data scientists, graduates and PhD students

Target

The scheme is open to any university student interested in cleaning datasets, producing graphics, learning about statistical models, and applying ML algorithms to real datasets.

🏳️ There are no requirements as per subject of your studies - data science skills are valuable tools to master in plenty of work areas!

☢️ Check out the level of complexity of projects! It goes by traffict light colors 🔴 🟡 🟢

The scheme is run by students for students - it intends to be an informal way to conduct a research experience that you may encounter for your dissertation or in other work experiences, it can turn out to be useful for your MSc or job application. We may plan a social event 💃 at Warwick or UCL (or both!) for the scheme's participants to chat about the projects, results and challenges with a glass of cold beer 🍺

What to expect

As one of WDSS & UCLDSS's summer researchers, you will be immersed in either group or independent work for one of the projects. We offer access to resources and support from experienced data scientists and subject ambassadors with specific domain epxertise.

  • run virtually :computer:
  • the projects are done in small teams, unless you wish otherwise
  • weekly meetings with a supervisor 🦸 a chance to seek guidance and suggestions, catch up with things, show your work and share your thoughts
  • produce a final report (with code to reproduce results) which will be published on Warwick and UCL blogs
  • there will be the possibility to present your work to YRM 🧑‍🏫, an informal seminar that gathers young researchers in statistics (no professors involved). This would be entirely optional, but a good opportunity to develop presenting skills in a friendly environment and showcase your work.

Useful skills

💻 Some knowledge in either R or Python.
📚 Basic knowledge in linear regression, algebra, hypothesis testing, exponential distributions.

Types of projects

Here you can find a list of the proposed projects. Feel free to suggest your own reseach question or get in touch to formulate one that suits your preferences!

🟢 Text analysis of Tweets

Text analysis is the process of using computer systems to read and understand text. Text can be in any shape and form - emails, tweets, social media comments, marketing copy, customer support tickets, survey responses.
There are many tasks that can be performed: gaining insights into emotions, moods or opinions using sentiment analysis. For example, a favorable review often contains words like good, fast, and great. However, negative reviews might contain words like unhappy, slow, and bad.


Useful links:

🟢 Top 10 Python libraries

Python is the most widely used programming language today. And, as a library-based language, it is vital to gain familiarity with the most popular and requested libraries. This includes TensorFlow, NumPy, SciPy, Pandas, Matplotlib, Keras, SciKit-Learn, PyTorch, Scrapy, BeautifulSoup.


Useful links:

🟢 Artificial Neural Networks

Artificial neural networks are forecasting methods that are based on simple mathematical models of the brain. They allow complex nonlinear relationships between the response variable and its predictors.

Application areas can lie in predicting solar radiation. The accurate prediction of solar radiation is crucial in both the solar industry and climate research. For example, forecasting the output power of solar systems is required for the good operation of the power grid or for the optimal management of the energy fluxes occurring into the solar system.

Useful links:

🟢 Spatial Statistics

The term spatial statistics refers to the application of statistical concepts and methods to data that have a spatial location attached to them, and in which this locational element is used as an important and necessary part of the analysis.

Its applications lie in exploring geographical determinants relevant for evaluating health intervention programmes or disaster management policies (eg, spatial inequalities within neighbourhoods such as poor road conditions that have an impact on child mortality or higher education outcomes). Another area of application could lie in exploring spatio-temporal dependencies to model climate data. Models that can be used include spatial autoregressive models, GLM, and Bayesian Hierarchical models.


Useful links:

🟡 ML with images

There exist numerous ML/DL algorithms that deal with images as input variables and perform different tasks, from reconstruction to detection, classification, etc.

Application areas can lie in the use of medical or satellite images.

Useful links:

🟡 Bayesian methods in sport statistics

Bayesian methods are becoming increasingly popular in sports analytics. Identified advantages of the Bayesian approach include the ability to model complex problems, obtain probabilistic estimates and predictions that account for uncertainty, combine information sources and update learning as new data become available. The volume and variety of data produced in sports activities over recent years and the availability of software packages for Bayesian computation have contributed significantly to this growth.

Useful links:

🟡 Reinforcement Learning

Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximise the notion of cumulative reward.

Areas of application can lie in algorithmic trading strategies.


Useful links:

🟡 ML for traffic prediction

Traffic prediction is the task of forecasting real-time traffic information based on floating car data and historical traffic data, such as traffic flow, average traffic speed and traffic incidents; its uses include managing vehicle movement, reducing congestion, and generating the optimal route. The existing ML approaches are the random forest algorithm that creates multiple decision trees and merges their data to obtain accurate predictions and the k-nearest neighbors (KNN) algorithm relies on the principle of feature similarity to predict future values.


Useful links:

🔴 Hidden Markov models

A Markov chain is a stochastic model that describes a sequence of events (random variables) where the probability of each event depends only on the previous event's state. These random variables can take values from a variety of sets: words, tags, or symbols representing anything, like the weather. In many cases, however, the events we are interested in are hidden: we don’t observe them directly. Eg, we don’t normally observe part-of-speech tags in a text. Rather, we see words, and must infer the tags from the word sequence. A hidden Markov model (HMM) allows us to talk about both observed and hidden events.

Applications that can be explored lie in the field of Natural Language Processing and Financial Time Series.


Useful links:

🔴 Generative Adversarial Networks (GAN)

Generative modelling is an unsupervised learning task in machine learning that involves automatically discovering and learning the regularities or patterns in input data in such a way that the model can be used to generate or output new examples that plausibly could have been drawn from the original dataset. GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated).

Application areas can lie in game theory.


Useful links:

About

This research scheme is run by the Warwick Data Science Society for the period of Term 1, academic year of 22/23

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published