Skip to content

Method: Wrangle Functions Intro

Matthew R. DeVerna edited this page Aug 23, 2021 · 2 revisions

osometweet.wrangle includes a handful of low-level data processing functions that we think could be useful when wrangling your Twitter data into something easier to analyze. The idea behind these functions was to create methods that you can easily adapt to your data processing pipeline, as opposed to creating our own that you must adopt.

Below we provide simple examples of how each function works.

Contents

Import

We can import these functions via...

from osometweet.wrangle import get_dict_paths, get_dict_val, flatten_dict

... where each function to the right of the import is a function you plan to utilize.

As a result, we will now be able to access each function directly. That is, if we want to use the flatten_dict function, we simply called flatten_dict(some_dictionary_object) and it will work.


flatten_dict

This function takes a nested dictionary and "flattens" it so that the keys of each nested dictionary are concatenated into a single string, and the value is the value at the end of that key path. This function can help you simplify the complexity of a nested dictionary (like Twitter's data objects) so it is easier to manage.

Let's see what this means.

# Create dictionary
dictionary = {
    "a" : 1,
    "b" : {
        "c" : 2,
        "d" : 5
    },
    "e" : {
        "f" : 4,
        "g" : 3
    },
    "h" : 3
}

1. Using function as is

flatten_dict(dictionary)

# Returns:
{'a': 1, 'b.c': 2, 'b.d': 5, 'e.f': 4, 'e.h': 3, 'i': 3}

2. Changing parent_key

This function has an available parameter called parent_key which helps it work. Typically, we would recommend that you do not touch this, however, here is what tinkering with this will do - should you find some use for it. 😄

# Parent key will add `parent_key` as a prefix to all keys
flatten_dict(dictionary, parent_key = "NEW")

# Returns
{'NEW.a': 1, 'NEW.b.c': 2, 'NEW.b.d': 5, 'NEW.e.f': 4, 'NEW.e.h': 3, 'NEW.i': 3}

3. Changing sep

Another parameter, sep, allows you to control the string that will separate each level of the concatenated key path. As you saw above, the default is a period (i.e., '.'), however, it can be whatever you prefer.

# This string is what will separate key path strings
flatten_dict(dictionary, sep = "/")

# Returns
{'a': 1, 'b/c': 2, 'b/d': 5, 'e/f': 4, 'e/h': 3, 'i': 3}

flatten_dict and Twitter data

It is important to note that the flatten_dict function handles all nested dictionaires but will stop when it reaches something other than a dictionary. What this means is for certain data objects which contain a list as the value (e.g. urls and context_annotations), further processing will be needed.

To understand what this means in more detail, I've created a walk-through of one way you might process a couple of tweets using this function while keeping the above in mind.


get_dict_paths

This function returns a generator that iterates over all full key paths within dictionary. Because Twitter often returns only the data that is present for a specific data object (for example, certain fields/expansions (see info, our methods for more details) will only be present within a data object if there is something to return for that field/expansion), this function can help you understand what your data object actually contains.

Here is a simple example...

# Create dictionary
dictionary = {
    "a" : 1,
    "b" : {
        "c" : 2,
        "d" : 5
    },
    "e" : {
        "f" : 4,
        "g" : 3
    },
    "h" : 3
}

# Call get_dict_paths
print(list(get_dict_paths(dictionary)))

# Returns
[['a'], ['b', 'c'], ['b', 'd'], ['e', 'f'], ['e', 'g'], ['h']]

get_dict_val

This function returns a dictionary value at the end of a key path - provided as a list, like those returned by get_dict_paths.

Here is what this function looks like in practice.

1. Finding an existing value

# Create dictionary
dictionary = {
    "a" : 1,
    "b" : {
        "c" : 2,
        "d" : 5
    },
    "e" : {
        "f" : 4,
        "g" : 3
    },
    "h" : 3
}

# Create key_list
key_list = ['b', 'c']

# Execute function
get_dict_val(dictionary, key_list)

# Returns
2

2. When input key_path doesn't exist

It is important to know that this function does not break should you be asking it to return a value at the end of a key path that doesn't exist. Instead, it will return None.

# Create key_list
key_list = ['b', 'k']

# Execute function
value = get_dict_val(dictionary, key_list)

# Returns NoneType because the provided path doesn't exist
type(value)
NoneType