-
Notifications
You must be signed in to change notification settings - Fork 3
Method: Wrangle Functions Intro
osometweet.wrangle
includes a handful of low-level data processing functions that we think could be useful when wrangling your Twitter data into something easier to analyze. The idea behind these functions was to create methods that you can easily adapt to your data processing pipeline, as opposed to creating our own that you must adopt.
Below we provide simple examples of how each function works.
We can import these functions via...
from osometweet.wrangle import get_dict_paths, get_dict_val, flatten_dict
... where each function to the right of the import
is a function you plan to utilize.
As a result, we will now be able to access each function directly. That is, if we want to use the flatten_dict
function, we simply called flatten_dict(some_dictionary_object)
and it will work.
This function takes a nested dictionary and "flattens" it so that the keys of each nested dictionary are concatenated into a single string, and the value is the value at the end of that key path. This function can help you simplify the complexity of a nested dictionary (like Twitter's data objects) so it is easier to manage.
Let's see what this means.
# Create dictionary
dictionary = {
"a" : 1,
"b" : {
"c" : 2,
"d" : 5
},
"e" : {
"f" : 4,
"g" : 3
},
"h" : 3
}
flatten_dict(dictionary)
# Returns:
{'a': 1, 'b.c': 2, 'b.d': 5, 'e.f': 4, 'e.h': 3, 'i': 3}
This function has an available parameter called parent_key
which helps it work. Typically, we would recommend that you do not touch this, however, here is what tinkering with this will do - should you find some use for it. 😄
# Parent key will add `parent_key` as a prefix to all keys
flatten_dict(dictionary, parent_key = "NEW")
# Returns
{'NEW.a': 1, 'NEW.b.c': 2, 'NEW.b.d': 5, 'NEW.e.f': 4, 'NEW.e.h': 3, 'NEW.i': 3}
Another parameter, sep
, allows you to control the string that will separate each level of the concatenated key path. As you saw above, the default is a period (i.e., '.'), however, it can be whatever you prefer.
# This string is what will separate key path strings
flatten_dict(dictionary, sep = "/")
# Returns
{'a': 1, 'b/c': 2, 'b/d': 5, 'e/f': 4, 'e/h': 3, 'i': 3}
It is important to note that the flatten_dict
function handles all nested dictionaires but will stop when it reaches something other than a dictionary. What this means is for certain data objects which contain a list as the value (e.g. urls and context_annotations), further processing will be needed.
To understand what this means in more detail, I've created a walk-through of one way you might process a couple of tweets using this function while keeping the above in mind.
This function returns a generator that iterates over all full key paths within dictionary
. Because Twitter often returns only the data that is present for a specific data object (for example, certain fields/expansions (see info, our methods for more details) will only be present within a data object if there is something to return for that field/expansion), this function can help you understand what your data object actually contains.
Here is a simple example...
# Create dictionary
dictionary = {
"a" : 1,
"b" : {
"c" : 2,
"d" : 5
},
"e" : {
"f" : 4,
"g" : 3
},
"h" : 3
}
# Call get_dict_paths
print(list(get_dict_paths(dictionary)))
# Returns
[['a'], ['b', 'c'], ['b', 'd'], ['e', 'f'], ['e', 'g'], ['h']]
This function returns a dictionary value at the end of a key path - provided as a list
, like those returned by get_dict_paths
.
Here is what this function looks like in practice.
# Create dictionary
dictionary = {
"a" : 1,
"b" : {
"c" : 2,
"d" : 5
},
"e" : {
"f" : 4,
"g" : 3
},
"h" : 3
}
# Create key_list
key_list = ['b', 'c']
# Execute function
get_dict_val(dictionary, key_list)
# Returns
2
It is important to know that this function does not break should you be asking it to return a value at the end of a key path that doesn't exist. Instead, it will return None
.
# Create key_list
key_list = ['b', 'k']
# Execute function
value = get_dict_val(dictionary, key_list)
# Returns NoneType because the provided path doesn't exist
type(value)
NoneType