SOGETI UK BLOG

truthIn the first part, some strange conclusions appeared to be drawn from data. Let’s try to explain what happened.

Folding/Aliasing

We analysed a sequence of data (people entering a mall, measured every 7 minutes during the day for a month) that we assumed to be periodic. So we “converted” our data from time to frequency to identify the rush periods (and we “converted” it back to check how close we were).

What we forgot in the process was that our sample rate (fs=once every 7 minutes, 2.38mHz, as it is “slow” we have to use milliHertz units) didn’t permit us to “find” anything with a frequency higher than fs/2 (1.19mHz in our case, something happening more than once in 14 minutes). A key aspect in converting information from the time domain to the frequency domain is that you need at least two data points influenced by a frequency in you observation period to be able to measure it.

“But we didn’t do anything wrong!” one could argue as we found 45 minutes (0.37mHz) and 17 minutes (0.99mHz) which are both below fs/2 (1.19mHz). But on the other hand we didn’t “clean” the data before applying our transformation tool. And there was evidently something (the main subway, up to 10 people every 12 minutes, 1.39mHz) hiding in our data. The 45 minutes was real (it’s the train frequency), but the 17 minutes wasn’t at all. It appeared, as a ghost, in the mirror of the Nyquist frequency (fs/2) due to “folding” (like folding a paper on a line at fs/2).

This folding is the first reason to always be careful in finding frequencies in measurements of phenomenon that can have variations above half the frequency rate. One way to do this is to increase the sampling frequency, here the same mall example with a one-minute sampling rate (still showing the estimation for comparison)

higher fsFig 1: higher sampling rate  

But it’s not that easy when you don’t know how “high” in frequencies reality can go, so applying a low-pass filter is usually the way to go, remembering that low-pass filters are not “vertical” and induce other artifacts (the stronger the filter, the more the phase is altered; some residue often remains if there are strong high frequency information and they should be cancelled out by generated added noise beforehand).

In more common-life experience, when you see a wheel rotating backward in a movie, or when you find a CD from an old mastering harsh, you will now know where it comes from (aliasing for the wheel and steep low-pass filters, low to no noise added beforehand plus sometimes residue of folding).

Regression to the mean

For the strange services performance improvement (and deterioration) of services comparing the best/worst performing groups in two successive tests (as in fact we discovered we did nothing between the two campaigns to improve or worsen their performance), I’ll let Derek Muller from Veritasium give you some explanation and examples:

https://www.youtube.com/watch?v=1tSqSMOyNFE

The main point here is to always be careful comparing an oriented subset of measures (because the reason of the selection can often impact the comparison), always check the population of measures (is it already an oriented subset?) and the effect of the measure on the data (am I measuring the tool or the phenomenon?). Another good practice is to keep an open mind and try to evaluate other hypotheses as well (because our main line of analysis is often convergent with the result we would like to observe).

Or you can use this in future career moves…

Conclusion

With data becoming easier to acquire, tools easier to use, quantities easier to overwhelm, use the tools you know well, keep an open mind on your hypothesis, and always ask a specialist when dealing with complex analysis and high stake results.

To read the original post and add comments, please visit the SogetiLabs blog: THIS PIECE OF DATA IS LYING! (2/2)

Related Posts:

  1. This piece of data is lying! (1/2)
  2. Three Data Scientists Share Six Insights on Big Data Analytics
  3. Secure Your Piece of the PIE: of All the Trillions that Pervasive Interaction Engine IoT Will Bring
  4. Big data and trends analysis: be aware of the human factor!

 

 

Claude Bamberger AUTHOR:
Claude Bamberger has been an information systems architect since his first job in 1994. Claude’s foray in innovation has been recognized twice by ANVAR (now OSEO), from a school project that formed into a lab and for to a collaborative Talent Management solution but also by the EEC for a custom TV channel for high speed Internet providers that has been deployed in multiple countries from Asia to Europe.

Posted in: A testers viewpoint, Big data, Business Intelligence, communication, Developers, Opinion      
Comments: 0
Tags: , , , , , , , , ,

 

lying-21

 

Data analysis is fascinating as arguably with a good source of data and appropriate tools (which are both becoming more and more accessible these days), you can see more clearly, explain what is happening and even predict the future.

In this first part, we will walk through two cases where data appears to be lying.

Fist case: The mall and the subway

In a mall we see people coming and going all day long, but couldn’t we predict the crowd a little bit more to adjust our sales animation?

Let’s measure people entering the mall every 7 minutes:

 

Fig1-measuresFig 1: number of people entering the mall during a 6 hour extract, one blue point every 7 minutes

Based on this data (in fact based on one month of such data), and the use of the “Power Spectral Density Estimator” tool in the new version of our data analysis system, we were able to identify the frequencies at which larger groups of people come into the mall!

We have two main frequencies: 45 minutes and nearly 17 minutes that, used as a simulation, correlate quite well with the measurement.

 

Fig2-estimation
Fig 2: same as figure one with the red curve figuring the estimation
In conclusion, knowing that the main train frequency during the day is 45 minutes is logical, but figuring out why there is a 17 minute frequency is difficult as none of the main subway schedules indicate this kind of timing.

Second case: The test

To improve our software performance, a complete “live” measurement campaign has been conducted on our services layer, establishing the most comprehensive test to date, one thousand services response time in real conditions.

Fig3-1st campaignFig 3: services performances, first measurement campaign  

The network team feel they could improve the results by prioritising the best performing ones (which are the most often used ones as the code of these services is already quite optimised) using QoS on the network.

In my team, we believe code reviews are the way forward. We think we can improve the results by reviewing the services and giving some advice to the development team and do this by taking the 100 “worst” performing services on the list and begin our work.

A month later, a new campaign is performed; we observe the same kind of measurement (globally), with a comparable mean and standard deviation.

And the results are…

 

Fig4-bottom 100 comparisonFig 4: impact of code reviews on 100 services  

Very good indeed! As you can see, the improvement (data moving “left”) is 10-50% in each rank and some have been improved by at least 40%.

Well, not so good results for some friends of ours…

 

Fig5-top 100 comparisonFig 5: impact of QoS on 100 top services

The “best performing” services are now even worse (data moving right) with some nearly doubling their result, most ranks are worse (the first two are a little bit better but it’s not worth it).

Based on the first set of data, we can conclude that the network was already very well setup and should be set back to the previous settings and we can schedule code reviews for the next 100 “worst performing” services to evaluate more closely the ROI before generalising this approach for every service used in critical applications.

But talking with our colleagues, we realised two very strange things:

  • the network team finally didn’t set up priorities as the network monitoring tools showed a very fluid network
  • the development team was swamped with a new mobile app to build and integrate and couldn’t integrate the result of our recommendations yet.

So no one did anything, but nevertheless the result changed dramatically. I can’t figure out why these results were so positive. Chance perhaps? Can you figure it out?

Interlude

Those two stories were a simplified illustration of things that can go wrong (or too good) when using data.

In the next part, we will look closer at the tools we used and how it can be explained.

To read the original post and add comments, please visit the SogetiLabs blog: THIS PIECE OF DATA IS LYING! (1/2)

Related Posts:

  1. Secure Your Piece of the PIE: of All the Trillions that Pervasive Interaction Engine IoT Will Bring
  2. Three Data Scientists Share Six Insights on Big Data Analytics
  3. Make big value out of your big data… and small Data
  4. Open data: Data for free or free datas? What’s the impact?

 

 

 

Claude Bamberger AUTHOR:
Claude Bamberger has been an information systems architect since his first job in 1994. Claude’s foray in innovation has been recognized twice by ANVAR (now OSEO), from a school project that formed into a lab and for to a collaborative Talent Management solution but also by the EEC for a custom TV channel for high speed Internet providers that has been deployed in multiple countries from Asia to Europe.

Posted in: Big data, Business Intelligence, Developers, IT strategy, Performance testing, Test Automation, test data management      
Comments: 0
Tags: , , , , , , ,

 

speedBumbAheadHow can we tackle performance, during design, build and run phases of a system to handle issues before they happen?

We all have been confronted with critical systems not performing as they should. This subject is fascinating as it is crossing nearly all the fields of I.T. systems design, build and run.

Defining performance: How should it perform?

Not performing as it should is our first clue. How should it perform? Has it been thoughtfully defined? As well defined as the entry fields colors, as the business rules, as this shiny button on the left?

Analysts often have difficulties and sometimes even avoid defining non-functional requirements. But these are critical requirements, especially performance, as it is often expensive to meet when excessive and expensive not to meet when the user’s experience suffers, in sales/audience/company image for any common website, employees motivation/engagement and process efficiency for any in-house system. Only business drivers can serve as foundations to define NFR, during technical design, it is often too late to acquire them back.

IMHO, agile approaches taught us very interesting techniques to focus on expressing user experience, and the growing field of business service level definition as well. Focusing on the interaction between each user and the system is key to define the experience and the main points of attention. Service level value on these main points completes the goal definition:

  • Key interaction scenario
  • Value of performance
  • Cost of non-performance

If any system specification can provide such inputs, software and technical designers/architects will have a very good starting point to work and communicate on the choices they make.

Using a model

An I.T. system is known to be complex because it is made of many moving parts, each living a predictive but non-linear life.

As engineers, we try to tackle that by modeling the system, grouping parts and subsystems in subsystems (engineers love nesting dolls). The system is now simpler to describe, less moving parts, hierarchical decomposition of understandable parts. But each part is now even less linear and less predictable. But is it?

In my experience, at a certain level, subsystems can be bound to a certain behavior, (not purely linear but following simple defined models), and sometimes, when it is not completely bound, design decision can be made at these few remaining “unstable” points to ensure the correct behavior of the system. And at this level indicators can be expressed in conjunction with the interaction scenario.

The issue here is usually to build this scenario (really implement the user side of it), and have enough experience to target the right level. This calls for an experienced architect with some background in testing. Here again, agile approached taught and gave us a lot on defining scenarios, implementing the user-side of it, setting it up in the continuous integration test cases.

By modeling the various subsystems and their interactions we can estimate the flow of events and place the resource constraints shaping it from linear to non-linear to pure overflow. The model itself can already demonstrate limits by applying performance level goals under the load defined by the composition of the scenarios (occurring concurrently) and common behavior of technical subsystems or contributing subsystems SLA. It can be very useful to pinpoint structural bottlenecks and to concentrate on what to test.

Testing, modeling and new perspective

In this first part we explored how defining performance goals and their context early in the project is key and how modeling during design phases can help tackle the issues before they arise.

In the next part of this article, we’ll dig a bit deeper in the testing zone and what to do when performance is such a major requirement and feature of the system that it’s built into it.

In the meantime, don’t forget to read Patrick’s posts on defining complexity, especially the third chapter on reflexivity which will be very useful (different subjects, same approach, that’s the versatility of CS).

 

Claude Bamberger AUTHOR:
Claude Bamberger has been an information systems architect since his first job in 1994. Claude’s foray in innovation has been recognized twice by ANVAR (now OSEO), from a school project that formed into a lab and for to a collaborative Talent Management solution but also by the EEC for a custom TV channel for high speed Internet providers that has been deployed in multiple countries from Asia to Europe.

Posted in: Agile, Performance testing, SogetiLabs, User Experience      
Comments: 0
Tags: , , , , , , , , , , , , , , , , ,

 

importioEnterprises IT is filled with plethora of information; nevertheless, it’s the most growing element these days, and projects aim at increasing it everyday.

The Big Data buzz keeps telling IT managers to be ready to increase the information volume. The Open Data community keeps telling IT managers, especially those at public institutions, to be ready to feed the world with invaluable data.

But there are not a lot of answers about what is data and how to “open” it (access it and make it accessible). There are neat and often expensive tools and approaches to store data, analyze data, and make data flow.

One main issue with the “what” aspect is that the data is generally not entirely available in the enterprise IT system, and even when it is, it’s not that accessible for every kind of work. Some of it is even freely available on the Internet; sometimes even the data of the enterprise itself is available but scattered, unstructured, in web pages.

The encounter

I recently met with a quite promising startup that aims at contributing to both the “what” and “how”: import•io. Their motto: “Transform information from the web into useable data”.

They build a platform that transforms websites into tables of data, making them as accessible as a spreadsheet or an API.  And a very clever aspect of this platform is that it can learn to read web pages by simply surfing with a browser and highlighting the interesting data on the pages.  The platform then does the heavy lifting, aggregating data from different pages into always up-to-date datasets, ready to be used, or accessed with the APIs.

Mixing

One interesting aspect of such a solution is its ability to mix information sources as the project goes, not just initially building a huge definitive platform just for established external data. Sources can grow in number or in volume with nearly no limits (some users already have thousands of sources, some data tables are small, some contain thousands of rows), and of course the project can still assemble in-house data internally as usual. And as the number of sources grow, the economy of the solution still improves as each source is quite light to build, like with most SaaS, instead of being an upfront investment. Moreover, the marginal cost itself of setting up a new source is reduced by the simplicity of use and the accessibility to less experienced profiles.

New perspectives

One aspect that made me reconsider some convictions I had is the speed and simplicity to address something that is usually not that simple. Building an API on your own data is so simple and fast with import•io browser that it could easily be better done with such platform than with in-house development.

I’ve had more than once the chance to design architecture to build a mobile version of a website, and it often ended up in services being shared or developed and new facades being built. I must say that today I would spend some time asking myself about the right mix between building in-house services and simply setting up new datasets.

This new kind of approach reminds me of the feeling I has discovering some hosting solutions that have shaken widely the way we saw the management of a website.

Another aspect that reinforced the convictions I have concerns the company itself that built this platform. It’s a 21st century company, building an impressive and ambitious business from scratch, and fast. Such a platform is built and monitored on top of other platforms and services accessed and integrated directly on the web. They add users exponentially every month, and it makes them stronger, not scared; it makes the platform grow and not stall; it makes the tools more effective and not overwhelmed. Everything is designed to grow like water lilies, the web way. And to do that, using SaaS/IaaS solutions assembled over the Internet is a promising path.

Claude Bamberger AUTHOR:
Claude Bamberger has been an information systems architect since his first job in 1994. Claude’s foray in innovation has been recognized twice by ANVAR (now OSEO), from a school project that formed into a lab and for to a collaborative Talent Management solution but also by the EEC for a custom TV channel for high speed Internet providers that has been deployed in multiple countries from Asia to Europe.

Posted in: Big data, Business Intelligence, Mobility, Open Data, Uncategorized      
Comments: 0
Tags: , , , , ,