Skip to content

Commit

Permalink
beta release update
Browse files Browse the repository at this point in the history
  • Loading branch information
alvarob96 committed Apr 4, 2019
1 parent 8caf508 commit 95d3cb0
Show file tree
Hide file tree
Showing 7 changed files with 609 additions and 180 deletions.
588 changes: 475 additions & 113 deletions .idea/workspace.xml

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ install:
- pip install pandas==0.23.4
- pip install requests==2.21.0
- pip install beautifulsoup4==4.7.1
- pip install investpy==0.7
- pip install investpy==0.8
- pip install pytest==4.1.1
- pip install lxml==4.3.2

Expand Down
101 changes: 52 additions & 49 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ To conclude this section, I am in the need to specify that this is not the final

In order to get this package working you will need to install [**investpy**](https://pypi.org/project/investpy/) from PyPi via Terminal typing:

``pip install investpy``
``pip install investpy==0.8``

All the dependencies are already listed on the setup file of the package, but to sum them up, you will need the following requirements:

Expand All @@ -28,74 +28,77 @@ All the dependencies are already listed on the setup file of the package, but to

## Use

Currently you just have two possible options to retrieve data with this scraper:
As this package is in a Beta Version, every use or application of the package that is not implemented can b developed for future releases, so do not hesitate on asking for them. So on, currently using investpy you can:

* **Retrieve the recent data of an equity/fund**: it retrieves the historical data of an equity/fund from the last month. The function also checks if the introduced equity/fund name is correct and then retrieves the data.
* **Retrieve the Recent Data of an Equity/Fund/ETF**: it retrieves the historical data of an equity/fund/etf from the last month. The function also checks if the introduced equity/fund/etf name is correct and then retrieves the data.
The function has some optional parameters like:
* *as_json* by default is **False** but if **True** the output of the function is a JSON object, not a pandas.DataFrame.
* *order* by default is **'ascending'** ordering the historical data in the pandas.DataFrame from the older to the newest, **'descending'** should be used for the contrary testing.
* *as_json*, by default is **False** but if **True** the output of the function is a JSON object, not a pandas.DataFrame.
* *order*, by default is **'ascending'** ordering the historical data in the pandas.DataFrame from the older to the newest, **'descending'** should be used for the contrary testing.

```python
import investpy

equities_df_ = investpy.get_recent_data('bbva', as_json=False, order='ascending')
funds_df_ = investpy.get_fund_recent_data('bbva multiactivo conservador pp', as_json=False, order='ascending')
equities_df = investpy.get_recent_data(equity='bbva', as_json=False, order='ascending')
funds_df = investpy.get_fund_recent_data(fund='bbva multiactivo conservador pp', as_json=False, order='ascending')
etfs_df = investpy.get_etf_recent_data(etf='bbva-accion-dj-eurostoxx-50', as_json=False, order='ascending')
```

* **Retrieve the historical data of an equity/fund from a specific range of time**: it retrieves the historical data from an equity/fund from a range of time between the start and the end date, specified in dd/mm/YY format. This function also checks is the introduced equity/fund name is correct and then retrieves the data.
* **Retrieve the Historical Data of an Equity/Fund/ETF from a Specific Range of Time**: it retrieves the historical data from an equity/fund/etf from a range of time between the start and the end date, specified in dd/mm/YY format. This function also checks is the introduced equity/fund/etf name is correct and then retrieves the data.
The function has some optional parameters like:
* *as_json* by default is **False** but if **True** the output of the function is a JSON object, not a pandas.DataFrame.
* *order* by default is **'ascending'** ordering the historical data in the pandas.DataFrame from the older to the newest, **'descending'** should be used for the contrary testing.
* *as_json*, by default is **False** but if **True** the output of the function is a JSON object, not a pandas.DataFrame.
* *order*, by default is **'ascending'** ordering the historical data in the pandas.DataFrame from the older to the newest, **'descending'** should be used for the contrary testing.

```python
import investpy

equities_df_ = investpy.get_historical_data('bbva', '10/10/2018', '10/12/2018', as_json=False, order='ascending')
funds_df_ = investpy.get_fund_historical_data('bbva multiactivo conservador pp', '10/10/2018', '10/12/2018', as_json=False, order='ascending')
equities_df = investpy.get_historical_data(equity='bbva', start='10/10/2018', end='10/12/2018', as_json=False, order='ascending')
funds_df = investpy.get_fund_historical_data(fund='bbva multiactivo conservador pp', start='10/10/2018', end='10/12/2018', as_json=False, order='ascending')
etfs_df = investpy.get_etf_historical_data(etf='bbva-accion-dj-eurostoxx-50', start='10/10/2018', end='10/12/2018', as_json=False, order='ascending')
```

* **Retrieve the Company Profile of an Equity**: you can retrieve the company profile of an equity in spanish or english, so you need to specify a valid equity name and a valid source.
The language of the Company Profile depends on the specified value for the following optional parameter:
* **source**, Investing for English Profile or Bolsa de Madrid for Spanish Profile, but default value is Investing, so the Company Profile that this function is going to retrieve is going to be in English.

You can check all the available equities/funds you can retrieve data from in Investing:
* Equities from the **Spanish Stock Market** -> https://es.investing.com/equities/spain
* Funds from the **Spanish Stock Market** -> https://es.investing.com/funds/spain-funds

(**NOTE**: you will need an active Internet connection in order to get the scraper working.)

## Performance Analysis

In this section I am going to explain the case study when developing the package and all the possible options when scraping in order to let you know which is the most efficient way to make a historical data scraper as far as I know based on my research over the past weeks.

Lets start with the first step before scraping a web, in this case [investing](https://es.investing.com/), the process of either downloading the web or sending a post request to a web. As we all know, there a two main tools used to get the HTML code from a website:
* [urllib3](https://pypi.org/project/urllib3/): urllib3 is a powerful, sanity-friendly HTTP client for Python. Much of the Python ecosystem already uses urllib3 and you should too. urllib3 brings many critical features that are missing from the Python standard libraries.
* [requests](https://pypi.org/project/requests/): Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. There's no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic, thanks to urllib3.

This unit tests are made with an stable Internet connection and done 500 times each, so we have a wide range of results in order to get to a better conclusion.
As we can see they are very related and similar, but with a significant efficiency difference when calculating the download time of a website's HTML code of a POST request, as shown in the graph:

![urllib3 vs requests](https://raw.githubusercontent.com/alvarob96/investpy/0.6/statistic%20plots/urllib3-requests.png)

If we analyse the graph, we can see that the mean time when sending a POST request is better when we use **requests** instead of **urllib3**, and it is also more stable and more consistent so on.

Once we have the HTML code resulting as the response to the POST request, we need to scrap the data from it and insert it into a pandas.DataFrame, so we are looking for a fast HTML parsing tool that allows us to retrieve huge loads of data really fast, so the user of the package does not wait too much.
The main Python packages used for HTML parsing are:
* [bs4](https://pypi.org/project/beautifulsoup4/): Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
* [lxml](https://pypi.org/project/lxml/): lxml is a Pythonic, mature binding for the libxml2 and libxslt libraries. It provides safe and convenient access to these libraries using the ElementTree API. It extends the ElementTree API significantly to offer support for XPath, RelaxNG, XML Schema, XSLT, C14N and much more.

This unit tests are made with an stable Internet connection and done 500 times each, so we have a wide range of results in order to get to a better conclusion.
To determine which has a better time performance, we are going to parse a HTML that contains historical data from the last 10 years, to see which package works better for huge loads of data, as shown in the graph:
```python
import investpy

equity_profile = investpy.get_equity_company_profile(equity='bbva', source='Investing')
```

* **Retrieve Information Available of a Fund**: it consists on retrieving all the additional information indexed in Investing.com from a specified fund. The function checks that the fund is valid and it retrieves the information from it.
This function has an optional parameter:
* **as_json**, if True instead of returning a pandas.DataFrame with the information it returns a JSON document, by default it is False.

![bs4 vs lxml](https://raw.githubusercontent.com/alvarob96/investpy/0.6/statistic%20plots/bs4-lxml.png)
```python
import investpy

fund_information = investpy.get_fund_information(fund='bbva multiactivo conservador pp', as_json=False)
```

* **Get a List of Available Equities/Funds/ETFs Names**: this function returns a list containing all the available equities/funds/etfs from where you can retrieve information from.

We can clearly see that **lxml** completely outperforms **bs4**, with a much more better time result when retrieving huge loads of data from a HTML file; and it is more stable, with less fluctuations being more consistent.
```python
import investpy

equities_list = investpy.get_equities_list()
funds_list = investpy.get_funds_list()
etfs_list = investpy.get_etfs_list()
```

To sum up, we can clearly determine that the best combination in this use case is to use **requests** to download the HTML code and process the POST/GET requests, while when parsing the HTML (data extraction) we determine that **lxml** completely outperforms any other Python HTML parser.
Or you can manually check all the available equities/funds/etfs indexed in Investing.com:
* Equities from the **Spanish Stock Market** -> https://es.investing.com/equities/spain
* Funds from the **Spanish Stock Market** -> https://es.investing.com/funds/spain-funds
* ETFs from the **Spanish Stock Market** -> https://es.investing.com/etfs/spain-etfs

If you have any other package you want to compare with the ones used in this case, feel free to send me a mail to alvarob96@usal.es and I will try my best to answer fast.
(**NOTE**: you will need an active Internet connection in order to get the scraper working.)

## Future Work
## Release Notes 0.8

* Latest Spanish Stock News
* Allow multiple input equities/funds/etfs
* Remove bs4 for equity/fund/etf list retrieval
* Company Profile Retrieval for All Equities
* Fund Historical Data Date Error Fixed
* Fund Overview Information Retrieval
* Functions for Listing Equities, Funds and ETFs

## Additional Information

Expand All @@ -104,7 +107,7 @@ so I fix them as fast as I can. Also, any new ideas or proposals are welcome, an

For further information or any question feel free to contact me via email at alvarob96@usal.es

You can also check my [Medium Publication](https://medium.com/research-studies-by-alvaro-bartolome/investpy-a-python-library-for-historical-data-extraction-from-the-spanish-stock-market-ad4d564dbfc5).
You can also check my [Medium Publication](https://medium.com/research-studies-by-alvaro-bartolome/investpy-a-python-library-for-historical-data-extraction-from-the-spanish-stock-market-ad4d564dbfc5), where I upload weekly posts related to Data Science and some of them explain investpy functions and development in a deeper way.

## Disclaimer

Expand Down
94 changes: 79 additions & 15 deletions investpy/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -280,10 +280,17 @@ def get_historical_data(equity, start, end, as_json=False, order='ascending'):
continue


def get_equity_description(equity):
def get_equity_company_profile(equity, source='Investing'):

available_sources = ['Investing', 'Bolsa de Madrid']

if not equity:
raise ValueError("ERR#012: as_json argument can just be True or False, bool type."
raise ValueError("ERR#012: equity parameter is mandatory and must be a valid equity name."
"\n\t\t\tPlease check you are passing the parameters correctly or contact package admin: alvarob96@usal.es"
"\n\t\t\tIf needed, open an issue on: https://github.com/alvarob96/investpy/issues")

if source not in available_sources:
raise ValueError("ERR#013: the specified source is not valid, it can just be either" + ' or '.join(available_sources) +
"\n\t\t\tPlease check you are passing the parameters correctly or contact package admin: alvarob96@usal.es"
"\n\t\t\tIf needed, open an issue on: https://github.com/alvarob96/investpy/issues")

Expand All @@ -302,23 +309,42 @@ def get_equity_description(equity):

for row in equities.itertuples():
if row.name.lower() == equity.lower():
url = "http://www.bolsamadrid.es/esp/aspx/Empresas/FichaValor.aspx?ISIN=" + row.isin
if source is 'Bolsa de Madrid':
url = "http://www.bolsamadrid.es/esp/aspx/Empresas/FichaValor.aspx?ISIN=" + row.isin

headers = {
"User-Agent": ua.get_random(),
"X-Requested-With": "XMLHttpRequest"
}
headers = {
"User-Agent": ua.get_random(),
"X-Requested-With": "XMLHttpRequest"
}

req = requests.get(url, headers=headers, timeout=5)
req = requests.get(url, headers=headers, timeout=5)

root_ = fromstring(req.text)
root_ = fromstring(req.text)

path_ = root_.xpath(".//td[contains(@class, 'Perfil')]")
path_ = root_.xpath(".//td[contains(@class, 'Perfil')]")

if path_:
return path_[0].text_content()
else:
return None
if path_:
return path_[0].text_content()
else:
return None
elif source is 'Investing':
url = "https://www.investing.com/equities/" + row.tag + "-company-profile"

headers = {
"User-Agent": ua.get_random(),
"X-Requested-With": "XMLHttpRequest"
}

req = requests.get(url, headers=headers, timeout=5)

root_ = fromstring(req.text)

path_ = root_.xpath(".//*[@id=\"profile-fullStory-showhide\"]")

if path_:
return path_[0].text_content()
else:
return None


def get_funds_list():
Expand Down Expand Up @@ -931,4 +957,42 @@ def get_etf_historical_data(etf, start, end, as_json=False, order='ascending'):
"\n\t\t\tPlease check your Internet connection or contact package admin: alvarob96@usal.es"
"\n\t\t\tIf needed, open an issue on: https://github.com/alvarob96/investpy/issues")
else:
continue
continue


# def get_available_sectors():
# sectors = []
#
# resource_package = __name__
# resource_path = '/'.join(('resources', 'equities.csv'))
# if pkg_resources.resource_exists(resource_package, resource_path):
# equities = pd.read_csv(pkg_resources.resource_filename(resource_package, resource_path))
# else:
# names = ts.get_equity_names()
# equities = pd.DataFrame(names)
#
# if equities is None:
# raise IOError("ERR#001: equities list not found or unable to retrieve."
# "\n\t\t\tPlease check your Internet connection or contact package admin: alvarob96@usal.es"
# "\n\t\t\tIf needed, open an issue on: https://github.com/alvarob96/investpy/issues")
#
# for row in equities.itertuples():
# url = "https://www.investing.com/equities/" + row.tag + "-company-profile"
#
# headers = {
# "User-Agent": ua.get_random(),
# "X-Requested-With": "XMLHttpRequest"
# }
#
# req = requests.get(url, headers=headers, timeout=5)
#
# root_ = fromstring(req.text)
#
# path_ = root_.xpath("/html/body/div[5]/section/div[8]/div[2]/a")
#
# if path_:
# sector = path_[0].text_content()
# if sector not in sectors:
# sectors.append(path_[0].text_content())
#
# return sectors
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@ def readme():

setup(
name='investpy',
version='0.7.6',
version='0.8',
packages=find_packages(),
url='',
download_url='https://github.com/alvarob96/investpy/archive/0.7.6.tar.gz',
download_url='https://github.com/alvarob96/investpy/archive/0.8.tar.gz',
license='MIT License',
author='Alvaro Bartolome',
author_email='alvarob96@usal.es',
Expand Down
Binary file removed statistic plots/bs4-lxml.png
Binary file not shown.
Binary file removed statistic plots/urllib3-requests.png
Binary file not shown.

0 comments on commit 95d3cb0

Please sign in to comment.