Insight_coding_challenge

The solution file to this coding challenge is src/rolling_median.py

Python 3 is used, therefore the environment needs to have it installed

Description of what src/rolling_median.py does:

Note that steps 2 through 6 is performed for each transaction

Note that EDGE_LIST is a list of lists with the format: [ [timestamp1,['actor1','target1']], [timestamp2,['actor2','target2']], ...etc]

Get transactions. Then for each transaction, do the following:

Check validity of entry - If it's missing actor or target, or if actor==target, ignore and drop it

```
Check and update timestamp: 
```
If timestamp is older than 60s, jump to call calc_median_degree() to end. This is done in accordance with what the FAQ said about still outputting a median value for each transaction even if the transaction is outside 60-second window

If timestamp is newer than newest, update the global newest_timestamp value

Delete edges that are older than 60 seconds. O(n) runtime complexity because the edgelist is not sorted by timestamp (it's not sorted at all). A b-tree will bring it to O(log n) for each insertion and deletion, and will also require resorting if vast amounts of it is deleted.

Insert each new edge entry into edge_list, checking that this new entry doesn't already exist.

Sort each edge (target, actor) alphabetically, since this is an undirected graph (e.g {target,actor} is equal to {actor,target})
Check that the edge doesn't already exist, if it does, update timestamp of that edge (no need to check for reverse order of the edge because each edge entry is already sorted). O(n) runtime complexity because edgelist is not sorted by edges (where an edge is: node name - node name).
```
Call calc_median_degree()
```
Concatenate the 2 columns of nodes in the edge_list to get list of all nodes with duplicates
Use python's counter to count number of occurence of each node to get the degree count for each node. O(n) runtime complexity because this list is unsorted, so it's just a linear search. This list of counts is also unsorted.
Then use python's statistics package to get the median. Probably O(n log n) complexity because requires sorting. Maybe can be done in O(n) time using median-of-medians algorithm, but there's some discussion in the python community about whether it's actually faster, so I stuck with python's statistics library. See https://bugs.python.org/issue21592 and http://stackoverflow.com/questions/10662013/finding-the-median-of-an-unsorted-array

src/rolling_median.py already imports all the pacakges it needs.

The following packages are used/imported:

import time - needed to deal with timestamps
import sys - for reading the arugments of the run.sh command
import json - for processing json
import os - for checking if output.txt already exists, and deleting it if it does
from collections import Counter - used for counting number of occurences in edgelist to get the vertex of a node
import statistics - used for finding the median
import pdb - python debugger, used for debugging

Note that this repo started off as a clone of https://github.com/InsightDataScience/coding-challenge

For testing, call "./run_tests.sh" from within the insight_testsuite directory

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data-gen		data-gen
images		images
insight_testsuite		insight_testsuite
src		src
venmo_input		venmo_input
venmo_output		venmo_output
.gitignore		.gitignore
README.md		README.md
run.sh		run.sh
scratchPad.ipynb		scratchPad.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Insight_coding_challenge

The solution file to this coding challenge is src/rolling_median.py

Python 3 is used, therefore the environment needs to have it installed

Description of what src/rolling_median.py does:

Note that steps 2 through 6 is performed for each transaction

Note that EDGE_LIST is a list of lists with the format: [ [timestamp1,['actor1','target1']], [timestamp2,['actor2','target2']], ...etc]

src/rolling_median.py already imports all the pacakges it needs.

Note that this repo started off as a clone of https://github.com/InsightDataScience/coding-challenge

About

Releases

Packages

Contributors 6

Languages

gylu/insight_coding_challenge

Folders and files

Latest commit

History

Repository files navigation

Insight_coding_challenge

The solution file to this coding challenge is src/rolling_median.py

Python 3 is used, therefore the environment needs to have it installed

Description of what src/rolling_median.py does:

Note that steps 2 through 6 is performed for each transaction

Note that EDGE_LIST is a list of lists with the format: [ [timestamp1,['actor1','target1']], [timestamp2,['actor2','target2']], ...etc]

src/rolling_median.py already imports all the pacakges it needs.

Note that this repo started off as a clone of https://github.com/InsightDataScience/coding-challenge

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages