Data Engineering at a glance!

Data engineering and live streaming analytics have been the need of the hour. The world today generates 2.5 quintillion bytes of data every day, deriving intelligence from those data has been the core priority for most of the users to perform actionable insights. We at TroonDx wanted to showcase a small demo of live streaming analytics using Spark for a New York Stock Exchange example, which provides a public API for us to perform the analytics.

Here’s a demo for you, completely simplified to showcase how it’s done. The methodology used is purely for a PoC, if you have a better way of doing this, let us know in the comments below.

Before getting into the content there are few terms which we should be familiarised with and those are as follows:

  • Symbol - Name of the stock.

  • Timestamp - Time at which we are getting the data.

  • open - The price at which the particular stock opens in the time period.

  • high - The Highest price of the stock during the time period.

  • low - Lowest price of the stock during the time period.

  • close - The price at which the particular stock closes in the time period.

  • volume - Indicates the total number of transactions involving the given stock in the time period.

Based on the data, some real-time analysis is needed to be done to generate insights that can be used to make informed decisions.

The data is retrieved with the use of API provided by Alpha vantage. The API returns the intraday time series(timestamp, open, high, low, close, volume) data specified in JSON format.

The API parameters are as follows:

  1. function - The time series of your choice. In this case, function=TIME_SERIES_INTRADAY.

  2. symbol - The name of the equity of your choice. For example, here symbol=MSFT (Microsoft).

  3. interval - The time interval between two consecutive data points in the time series. The following values are supported: 1min, 5min, 15min, 30min, 60min. Here, interval=1min

  4. apikey - your API key.

  5. outputsize - by default, outputsize=compact. Strings compact and full are accepted with the following specifications: compact returns only the latest 100 data points in the intraday time series; full returns the full-length intraday time series. This parameter is optional.

After hitting the API the data will be received. The API uses a demo API key. Own API key can be received using Alpha Vantage. The obtained key should be used in the python script along with the API to fetch the data.

The below picture represents the output of the python script after obtaining the data.

TroonDx Python Script

Here’s a glimpse of the sample data received -

Data Received After The Script Run

Once the NYSE opens, a script is run to fetch the data relating to the stocks every minute inside a particular folder. The script will make use of the API provided by the Alpha Vantage as described above. Alongside a Spark application runs to stream data from the folder every minute and performs the analyses on the data. The results of the analysis are to be written in an output file. These results will act as insights to make informed decisions related to the stocks.

The four major stocks involved are:

  • Facebook(FB)

  • Google(GOOGL)

  • Microsoft(MSFT)

  • Adobe(ADBE)

Data are fetched every minute relating to the above mentioned 4 stocks.

The above stocks are listed on NYSE (New York Stock Exchange) located in EDT timezone. Daily opening hours for NYSE are from 9:30 AM to 4:00 PM GMT (- 4) and it remains closed on Saturday and Sunday. At the time of running the script, we have written a logic to keep a note of the timezone and the opening hours to get the data and run the script during the time NYSE is open. I.e, It will remain open from 7.00 PM IST to 1.30 AM IST.

Read the data file generated inside the folder every minute and convert the data into DStreams in Spark.

We tried to bring out real-time insights using spark which will prove to be useful for the traders. The spark application was designed to produce 4 primary insights using its inbuilt DAG and DStreams library.

Here are a few steps that were followed -

1. The computation for a Simple moving average at the end cost of the four stocks is determined on a 5-minute sliding window throughout the previous 10 minutes. The end cost is utilized for the most part by the dealers and speculators as it mirrors the cost at which the market at long last settles down. The SMA (Simple Moving Average) is a parameter used to locate the normal stock cost over a specific period depending on a lot of parameters. The straightforward moving normal is determined by including a stock's costs over a specific period and isolating the entirety by the absolute number of periods. The straightforward moving normal can be utilized to distinguish purchasing and selling stocks.

Why: Simple Moving Average is used for forecasting constant demand for a symbol and to separate out the random variation. A rising moving average indicates that the symbol is in an uptrend, while a declining moving average indicates that it is in a downtrend.

2. Distinguishing the stocks which give the greatest profit(Average shutting cost - Average opening cost) out of each of the four is chosen in a 5 - minute sliding window throughout the previous 10 minutes.

Why: To identify the symbol which is giving maximum profit over a specific period of time and could be used to compare with other symbols.

3. The Relative Strength Index or RSI of the four stocks to be found in a 1 - minute sliding window throughout the previous 10 minutes. RSI is considered overbought when over 70 and oversold when beneath 30. The equation to ascertain the RSI is as per the following:

To disentangle the figuring clarification, RSI has been separated into its fundamental segments: RS, Average Gain and Average Loss. This RSI computation depends on 14 periods, which is the default recommended by Wilder in his book. Misfortunes are communicated as positive qualities, not negative qualities.

The absolute first figurings for normal addition and normal misfortune are basic 14-period midpoints.

First Average Gain = Sum of Gains in the course of the last 14 periods/14.

First Average Loss = Sum of Losses in the course of the last 14 periods/14

The second, and resulting, computations depend on the earlier midpoints and the present addition misfortune:

Normal Gain = [(previous Average Gain) x 13 + current Gain]/14.

Normal Loss = [(previous Average Loss) x 13 + current Loss]/14.

Taking the earlier incentive in addition to the present esteem is a smoothing procedure like that utilized in figuring an exponential moving normal. Here, for our situation, we will utilize 10-period midpoints

Why: Helps to identify when to buy or sell stock based on the RSI indicator. RSI indicates whether a stock is oversold or overbought. A trader might buy when the RSI crosses above the oversold line 30 and sell when the RSI crosses below the overbought line 70.

4. Computation of the exchanging volume of the four stocks at regular intervals and the choice of which stock to be obtained out of the four is to be finished. Volume assumes a significant job in the specialized investigation as it causes us to affirm patterns and examples. Volumes are a pointer of what number of stocks are purchased and sold over a given timeframe. Higher the volume, almost certain the stock will be purchased.

Why: It is important as it lets us keep track of the total number of stocks bought and sold for a particular period of time.

All these data are being fed into the elastic search and respective indexes are created accordingly.

Data Fed Into Elastic Search & Index Creation

The data pushed to elastic search could be discovered using the option discover in Kibana. Once added to Kibana, one can build different types of charts available in the system.

Here are some examples of the visual representations of the data from Kibana.

Visual Representation From Kibana

Here's an information flow diagram -

Information Flow

1. The Alpha vantage API returns a specific stock information from NYSE. The parameters of this API has been discussed before in this article.

2. A python script is being created to retrieve data from the New York Stock Exchange using API.

3. Analyses the data, the results will act as insights to make divisions.This analyses process has mainly four important calculations which are as follows:

A. To Find Moving Average

B. To Find Maximum Profit

C. To find RSI

D.Calculation of trading volume

Let's go step by step in explaining all the four main operations:

Before moving into these major analyses, Read Streaming JSON Data is performed by the spark.

Streaming DataFrames can be created through the data stream reader interface returned by



The topology diagram self explains the read and streaming a JSON file.

The topology diagram self explains the read and streaming a JSON file.

Topology For Read Streaming JSON File

Topology diagram self-explaining about both Maximum profit and moving average

Topology For Simple Moving Average & Max.Profit

A. To Find Moving Average:

groupBy() and window() operations are used to calculate windowed aggregation.

withWatermark() operation used to define the watermark of a query by specifying the event

time column and the threshold on how late the data is expected to be in terms of event time.

Closing price is used to calculate the simple moving average for stocks in a 5-minute sliding

the window for the last 10 minutes.

The below chart represents the visual representation of the analysed data for “Moving Average”.

Visual Representation of Moving Average

B.Calculate Maximum Profit:

Maximum profit is calculated by subtracting Average Opening from Average Closing during the operation performed for calculation simple moving average.

The below chart represents the visual representation for Maximum Profit from the analysed data.

Visual Representation For Maximum Profit

C. To FInd RSI(Relative Strength Index)

The below figure is the topology representation of RSI and Total Trading Volume.

Topology Representation For Relative Strength Index & Total Trading Volume

RSI is considered overbought when above 70 and oversold when below 30. The formula to calculate the RSI is as follows:

RS = AverageGain / AverageLoss

RSI = 100 - 100


1 + RS

RSI esteem is determined in a client characterized capacity dependent on the recipe gave and brought in the flash SQL.

Gain is determined by subtracting the opening cost from the end cost when the end cost is more prominent than opening and Loss is determined by subtracting shutting cost from the opening cost when the opening is more prominent than misfortune.

Normal Gain and Average Loss are determined with the information Gain and Loss as above in a 1-minute sliding window throughout the previous 10 minutes. Normal Gain and Average Loss are passed as parameters to a client characterized capacity to compute RSI.

The below chart represents the visual representation of RSI from the analysed data.

Visual Representation For RSI

D.Calculation of Trading volume

Sum() activity is utilized to compute the exchanging volume of the four stocks at regular intervals and choose which stock to buy out of the four stocks.

This below chart represents the visual representation of Total Trading Volume from the analysed data.

Visual Representation For Total Trading Volume

Here’s a glimpse of what we learnt -

This blog mainly helps us provide insights into the current trends of the NYSE ( New York Stock Exchange.

  1. The collection of data from the NYSE is achieved in the format of JSON file using a python script through API

  2. This script retrieves the data continuously for particular stocks (Facebook,Microsoft,Adobe and Google) for a regular interval of time

  3. These JSON files are retrieved by Spark Application in streaming fashion for calculation of Simple Moving Average, Relative Strength Index, Total Trading Volume, Maximum Trading Volume and transformation as required for indexing

  4. The transformed values are indexed accordingly and fed into the Elastic search. These data are viewed in Kibana with help of different types of charts and dashboards

  5. The dashboard is used to view all the charts in a single frame.

Hope you found this useful!

Credits -

Ganesh Kumar - Solutions Engineer, TroonDx

Santosh A.M - Data & Blockchain Intern, TroonDx

233 views0 comments


About Us




Contact Us


OAS Towers, Stringers Street,

Vepery, Chennai -600003

©2019 by TroonDx. - Global Blockchain Development Company  All Rights Reserved.