Blogs
|
By Julien DelangeMember of the Technical StaffSoftware Solutions Division
Introducing new software languages, tools, and methods in industrial and production environments incurs a number of challenges. Among other necessary changes, practices must be updated, and engineers must learn new methods and tools. These updates incur additional costs, so transitioning to a new technology must be carefully evaluated and discussed. Also, the impact and associated costs for introducing a new technology vary significantly by type of project, team size, engineers’ backgrounds, and other factors, so that it is hard to estimate the real acquisition costs. A previous post in our ongoing series on the Architecture Analysis and Design Language (AADL) described the use of AADL in research projects (such as System Architectural Virtual Integration (SAVI)) in which experienced researchers explored the language capabilities to capture and analyze safety-critical systems from different perspectives. These successful projects have demonstrated the accuracy of AADL as a modeling notation. This blog post presents research conducted independently of the SEI that aims to evaluate the safety concerns of several unmanned aerial vehicle (UAV) systems using AADL and the SEI safety analysis tools implemented in OSATE.
At the April 2014 AADL standardization committee meeting in Santa Barbara, CA, Jerome Hugues, a professor at the Institue Supérieur de l’Aéronautique et de l’Espace (ISAE) in Toulouse, France, presented a project that aims to evaluate the safety concerns of several UAV systems using AADL and the SEI safety analysis tools implemented in OSATE. This study’s results were quite a surprise for us: the study was done independently, without any support from the SEI research team, even when safety analysis tools were under development. More impressive was that the team acquired the basics of AADL and its associated error notation very quickly, created the models for different systems, and generated safety validation materials automatically from the models within a month.
This experiment shows how quick and easy it is to learn AADL and use its associated tools to analyze a system. I wanted to learn more about this project, and Hugues gave me the opportunity to ask him a few questions:
Can you introduce yourself? Please tell us a little bit about where you work and how your organization uses AADL. When and how were you introduced to AADL?
I am an associate professor at ISAE, the reference engineering school in aeronautics and space engineering in France. Since 2010, I have served as co-chair of the Advanced Master on Embedded Systems (SM EMS) program at École Nationale Supérieure d'Électronique, d'Électrotechnique, d'Informatique, d'Hydraulique, et de Télécommunications (ENSEEIHT) and ISAE. I specialize in the teaching of software architecture. My research focuses on embedded systems architecture, with a strong focus on the AADL. I have been a member of the AADL standardization committee since 2006, and a member of its steering committee since 2011.
I was first introduced to AADL as part of the IST ASSERT project. I remember quite well my first encounter with Bruce Lewis, our AADL "Sherpa" during the Ada-Europe Conference on Reliable Software Technologies in Palma de Mallorca: it was 10 years ago, in June 2004. At that time, we were looking for an architecture design language with clear semantics and a textual representation, all of which were part of the initial requirements of the Society of Automotive Engineers (SAE) AADL standard.
Can you introduce/explain your project, its context, and objectives? What challenges did you want to address, and why did you select AADL as the modeling language?
As part of my research, I evolved from modeling patterns to modeling distributed, embedded systems and code generation to system engineering concerns. I am interested in a pragmatic solution to define stringent, tool-supported methodologies for the engineering of critical systems. Achieving this solution implies code generation that meets space industry standards as part of the TASTE project, scheduling analysis, and model checking. More recently, I became interested in defining a better coupling between analysis tools and models. Using AADL, as well as being part of the group defining it, provides a perfect context in which to discuss all those concerns and define practical solutions.
Figure 1. One of the UAV Systems Used by the ISAE Team
Please explain for us how you used AADL in this project. If possible, please give us the nuts-and-bolts details.
Recently, the AADL committee started an effort to support reliability analysis as part of modeling and verification activities supported by the language. In the meantime, the SEI developed various tools to support those analyses in the scope of the ARP4761 standard. I became curious about how far we could go with these tools. As head of a master’s degree program at ISAE, I have the chance to teach students with strong backgrounds not only in safety but also in aeronautical engineering. I thus proposed a challenge to my students: model one family of four UAVs that we develop at ISAE using AADL, and perform safety analysis to evaluate the reliability metrics of our UAVs. It started as an exploratory project, yet the results went beyond our expectations. Not only could we abstract the key elements of the UAVs; we could also perform a wide range of analyses using a single model, as opposed to using separate tools.
Being an aeronautics school, ISAE is developing prototype UAVs for both research and teaching purposes. These UAVs range from quadrotors (two variants) to fixed-wing planes. They share a common embedded part, on top of which various command laws, sensors, and actuators are deployed. Our modeling objective was to capture product-line concerns in the form of a library of components, and then to platform specific configurations as specific instances of these building blocks. This first step demonstrated that AADL could capture many of the elements considered when engineering UAVs.
The second step was to model faults and their propagation in the architecture. To do so, we defined problematic situations in our systems. Examples of these problematic situations include bad errors, loss of a component, transient errors in communication, etc.
The final step was to perform safety analysis. Our team of researchers could directly apply analysis plug-ins bundled with OSATE2, thanks to the SEI,. We performed all kinds of analyses required by the aeronautic industry, including functional hazard assessments, failure mode and effects analysis, fault tree analysis, and more.
Figure 2. Overview of the Architecture Model Designed by Dr. Hugues and his Team
In addition to yourself, who else is involved in the project, and what are their backgrounds? How do they learn AADL? Have they faced any specific issues or problems?
The team——in addition to myself, the team included Nicolas Chatonnay, Julien Bosseboeuf, Jérôme Pierra (MS EMS students), Jacques Lamaison and Alain Hostalier (ISAE/DMIA engineering team)—was made up of three students with backgrounds in telecommunications, two lab engineers, and me. I was the only person with a solid background in AADL and related model-based technologies. My students were exposed to a 20-hour AADL class that I teach. This teaching was combined with the availability of the book Model-Based Engineering with AADL by Peter Feiler and David Gluch.
The modeling activity was straightforward thanks to the definition provided in the Error Model Annex defined by the AADL committee. This straightforwardness is the big strength of AADL: the language and its annexes are defined using the engineer’s words, not through specific modeling-language verbiage. Project meetings were more focused on "what do we want to model?" rather than "how do we model such a pattern?" This focus is definitely a good point when considering a specific modeling notation.
What tools did you use to model and analyze the system?
We relied on OSATE2.0.5 and the OpenFTA toolset. OSATE maps error models to OpenFTA for further analysis.
Figure 3. Generating the Fault Tree from the AADL Model
How much time was required to learn AADL basics and error model notation and tools? How long did it take to resolve the complete problem from start to finish?
The full project—including learning EMV2, mastering tools, modeling, and finally, analyzing—was done in four weeks. This time span included writing reports detailing all elements. The models represent 2,000 source lines of code of AADL, covering all components, interconnections, and associated error models.
What feedback have you received from this project? What is your takeaway?
Students were quite impressed to achieve all the objectives in such a short time frame. Collaborative modeling, analysis plug-ins, and power of expression of the language are satisfactory to conduct safety analysis for UAVs. The ability to achieve these objectives constitutes a good case study for those interested in AADL. We plan to publish this case study along with accompanying notes in the near future
Do you have plans for future work on this project? If yes, do those future plans involve AADL?
Now that we have demonstrated safety figures for the UAV, the next big step at ISAE is to start from these models and move toward code generation. ISAE specializes in control systems, and we have design tools for code generation from AADL models and could also integrate Simulink models. It would be great to first generate back code that was previously handwritten, and then ascertain that the quality of the code does not hinder safety. We have an ongoing project to apply theorem provers to generated code.
Wrapping Up and Looking Ahead
This experience report shows the ease with which AADL can be used to model a family of systems and apply safety-analysis techniques. ISAE’s experience applying UAVs also demonstrates that AADL analysis tools are relatively straightforward and do not require experienced skills in model-based engineering. As Hugues reports, learning the language does not seem to be an issue and the team learned the language basics and received significant results within a few weeks. The ease with which Hugues’ team learned the language demonstrates that engineers with appropriate backgrounds can quickly acquire and apply the technology. And because "the language and its annexes are defined using the engineer’s words," as Hugues reports, it is easy to apply and transition engineering concepts in AADL. For that reason, users were able to capture systems concepts with the language and focus on their engineering domain (i.e., safety).
This study focuses mostly on safety aspects, but it could also apply to other engineering domains, such as security and performance. AADL is an extensible language, so users can tailor it to capture the quality attributes that they would like to analyze.
We welcome your feedback below.
Additional Resources
For more information about AADL, please visit
http://www.aadl.info/aadl/currentsite/and our wiki at https://wiki.sei.cmu.edu/aadl/.
SEI
.
Blog
.
<span class='date ' tip=''><i class='icon-time'></i> Jul 27, 2015 02:08pm</span>
|
|
By Ian GortonSenior Member of the Technical Staff Software Solutions Division
In earlier posts on big data, I have written about how long-held design approaches for software systems simply don’t work as we build larger, scalable big data systems. Examples of design factors that must be addressed for success at scale include the need to handle the ever-present failures that occur at scale, assure the necessary levels of availability and responsiveness, and devise optimizations that drive down costs. Of course, the required application functionality and engineering constraints, such as schedule and budgets, directly impact the manner in which these factors manifest themselves in any specific big data system. In this post, the latest in my ongoing series on big data, I step back from specifics and describe four general principles that hold for any scalable, big data system. These principles can help architects continually validate major design decisions across development iterations, and hence provide a guide through the complex collection of design trade-offs all big data systems require.
First Principle: System Costs Must Grow More Slowly Than System Capacity
Imagine you’re asked to build a system that will initially manage and analyze 1 petabyte (PB) of data, and the budget to build and operate this system in the first year is $2 million. In addition, data and processing are expected to double in size every year, and performance and availability requirements are expected to remain stable. This growth pattern means that in four years, your system will be managing 16PB of data, and in six years, 64PB. For many development teams, this growth rate would be a daunting requirement.
Now imagine that your design can sustain this growth, but the associated costs will also double each year, which means that in four years the project will require a budget for development and operations of $32 million. That’s probably not an estimate many clients will like. What is reasonable will of course depend on how the system functionality will evolve, but linear cost growth estimates are more likely to be acceptable. This principle that costs must grow much slower than capacity is depicted in Figure 1 below.
Figure 1. Costs grow slowly as capacity grows quickly
Adhering to this principle requires the system architecture to directly address and minimize the costs associated with rapidly growing capacity. For example, choosing a database that can be expanded with minimal manual intervention (for example, one that distributes data using consistent hashing) will make increasing data capacity a fixed, low cost activity. If a request load is cyclic, experiencing peaks and troughs, you can create an elastic solution that handles peaks by provisioning new cloud-based resources as required, and tearing these down to save costs once the peaks are over. If historical data will be accessed infrequently, it may be possible to store it on slower, low-cost storage media and build summarized views of the data for rapid access. Or, if the access patterns are amenable, the most frequently accessed historical data could be cached in online data stores.
Scalable software architectures therefore need to constantly seek out and implement efficiencies to ensure costs grow as slowly as feasible, while data and processing capacity grow as rapidly as requirements demand. These efficiencies pervade all layers of big data architectures, and by testing design options against this principle, scalable solutions are much more likely to emerge.
Second Principle: The More Complex a Solution, the Less Likely it Will Scale
Most of us are taught at an early age that if a deal sounds too good to be true, it probably is. Common sense tells us investments that are guaranteed to grow at 100 percent a year are almost certainly bogus or illegal, so we ignore them. Unfortunately, in building scalable software systems, we commonly see common sense put on the back burner when competing design alternatives and products are evaluated as candidates for major components of big data systems.
Let’s take a simple example: Strong consistency in databases is the bedrock of transactional systems and relational databases. Implementing strong consistency, however, is expensive, especially in distributed databases. To build highly scalable and available systems, the NOSQL database movement has consequently weakened the consistency models we can expect from databases. This trend has occurred for a good reason: weak consistency models are inherently more efficient to implement because the underlying mechanisms required are simpler.
In response, relational databases and the NewSQL technologies are now turning to new implementation models that provide strong consistency. NewSQL solutions aim to achieve the same scalable performance of NoSQL systems for online transaction processing workloads, while simultaneously supporting the atomicity, consistency, isolation, and durability (ACID) properties found in traditional SQL database systems. This approach sounds attractive, and some of the new open source technologies that exploit main memory and single-threading show immense promise. But fundamentally, achieving strong consistency requires more complexity, and as the scale of the problem grows, it is almost certainly not going to scale as well as weak consistency models.
Of course, weak consistency models will give your application greater scalability, but there are trade-offs. You probably have to de-normalize your data model and, hence, manage any duplication this introduces. Application code has to handle the inevitable conflicts that arise with weak consistency models. As always, there’s no free lunch. But, if your data and workload are amenable to a weak consistency model (and many are, even ones we think of as needing strong consistency), it will be your path to scalability.
Another example of the second principle is the scalability of many simple concurrent read requests versus concurrent queries that invoke complex, statistical machine learning and data mining approaches to analyze and process tens of MBs of data. We often see requirements that demand the latter, along with 2-second response times. Common sense tells us that is unlikely to happen and certainly won’t scale (this is when you consider materialized views and caches), but often in these circumstances, common sense simply doesn’t seem to prevail. This principle is depicted below in Figure 2.
There is one more key point to remember. Even though one design mechanism may be fundamentally more scalable than another, the implementation of the mechanism and how you use it in applications, determines the precise scalability you can expect in your system. Poorly implemented scalable mechanisms will not scale well, and from our experience these are not uncommon. The same applies to inappropriate usage of a scalable mechanism in an application design, such as trying to use a batch solution like Hadoop for real-time querying.
Adhering to the second principle requires thinking about the fundamental distributed systems and database approaches that underpin design decisions. Even simple rules of thumb can be enormously beneficial when considering how a design may scale. Ignore these details at your peril, as many have recently found out.
Third Principle: Avoid Managing Conversational State Outside the Data Tier
State management is a much debated and oft misunderstood issue. Many frameworks, for example the Java Enterprise Edition (JEE), support managing state in the application logic tier by providing explicit abstractions and simple application programming interfaces (APIs) that load the required state from the database into service instance variables, typically for user session state management. Once in memory, all subsequent requests for that session can hit the same service instance, and efficiently access and manipulate the data that’s needed. From a programming perspective, stateful services are convenient and easy.
Unfortunately, from a scalability perspective, stateful solutions are a bad idea for many reasons. First, they consume server resources for the duration of a session, which may span many minutes. Session lengths are often unpredictable, so having many (long-lived) instances on some servers and few on others may create a load imbalance that the system must somehow manage. When sessions do not terminate cleanly (e.g., a user does not log out), an instance remains in memory and consumes resources unnecessarily before some inactive timeout occurs and the resources are reclaimed. Finally, if a server becomes inaccessible, you need some logic, somewhere, to handle the exception and recreate the state on another server.
As we build systems that must manage many millions of concurrent sessions, stateful services simply do not scale. Stateless services, where any service instance can serve any request in a timely fashion, are the scalable solution. In a stateless architecture, the session state is securely passed as parameters from the client with each request, and the data is accessed from the persistent data tier and/or from caches, where it is ephemeral and only causes a performance hit if not present. In server failure cases, requests can be simply routed to another service instance, and new service nodes can be started at any time to add capacity. Passing state with each request does consume more network resources, but the amount of resources is typically small as the state that must be communicated is conversational. A conversational state is only needed while a session (conversation) is in progress to control a sequence of interactions from a single client, and, hence, is limited in scope and size.
Stateless solutions scale trivially by adding resources, simplifying application logic and system operations. The inevitable design trade-off is that stateless solutions place more load on your data tier, making the scalability of an application’s databases a crucial design factor. As Martin Kleppmann writes in his blog post, Six Things I Wish We Had Known About Scaling, "the hard part is scaling the stateful parts of your system: your databases." For this reason, I am working with other SEI researchers to develop LEAP(4BD), to help organizations fully understand the implications of their data tier choices for big data systems.
Fourth Principle: You Can’t Manage What You Don’t Monitor
Big data systems rapidly attain deployment scales that change many accepted wisdoms of software engineering. Put simply, two of these challenges to accepted wisdom are
The more software and hardware components big data systems have, the higher the likelihood that failures occur.
It is impractical to fully test new code because tests will become obsolete as soon as the size of the data on which the code operates grows. As Kleppmann explains, "realistic load testing is hard."
The only feasible solution to these two problems is to weave powerful monitoring and analysis capabilities into your applications and deployment infrastructure. By carefully monitoring how systems behave as code and databases and deployments scale, it becomes possible to more easily respond to failures and even proactively take actions as pressure points build up.
This last topic is complex, and worthy of treatment by itself, which I’ll do in my next installment in this blog series. In summary though, deep and flexible monitoring and analysis capabilities are fundamental to the success of big data systems, and your architectures must be designed to take into account the costs of these capabilities, which can be considerable. As we’ll see, this requirement for production-time data capture and analysis has many implications for successful big data system deployments.
Final Thoughts and Looking Ahead
The four principles described above hold for any big data system, so adhering to them will always be a good thing. In contrast, unconsciously violating these principles is likely to lead a system into downstream distress, slowing capability delivery, massively inflating costs and potentially leading to project failure. Of course, simple expediency may mean you have to violate a principle to meet a near-term deliverable. Violations are not a bad thing, as long you recognize the technical debt that has been incurred and plan accordingly to pay this debt back before the interest incurred becomes a major problem.
The next post in this ongoing series on big data will examine the challenges of scale and how observability is fundamental to the success of big data systems.
Additional Resources
To listen to the podcast, An Approach to Managing the Software Engineering Challenges of Big Data, please visithttp://url.sei.cmu.edu/iq.
For more information about the the Lightweight Evaluation and Architecture Prototyping (for Big Data), known as LEAP(4BD), please visit http://blog.sei.cmu.edu/post.cfm/challenges-big-data-294.
SEI
.
Blog
.
<span class='date ' tip=''><i class='icon-time'></i> Jul 27, 2015 02:07pm</span>
|
|
By Grace Lewis Principal Investigator, Edge-Enabled Tactical SystemsSoftware Solutions Division
Many warfighters and first responders operate at what we call "the tactical edge," where users are constrained by limited communication connectivity, storage availability, processing power, and battery life. In these environments, onboard sensors are used to capture data on behalf of mobile applications to perform tasks such as face recognition, speech recognition, natural language translation, and situational awareness. These applications then rely on network interfaces to send the data to nearby servers or the cloud if local processing resources are inadequate. While software developers have traditionally used native mobile technologies to develop these applications, the approach has some drawbacks, such as limited portability. In contrast, HTML5 has been touted for its portability across mobile device platforms, as well an ability to access functionality without having to download and install applications. This blog post describes research aimed at evaluating the feasibility of using HTML5 to develop applications that can meet tactical edge requirements.
Foundations of Our Work
HTML5, developed by the World Wide Web Consortium (W3C), is the fifth release of the core technology markup language used for structuring and presenting content for the World Wide Web. HTML5 mobile applications are similar to web applications in that they use the same technologies for software application development: HTML, JavaScript, and CSS3. In addition to portability, HTML5 allows users to access functionality without having to download and install applications. In fact, only an HTML5-enabled web browser is needed to run HTML5. This flexibility also makes it easier to distribute applications to edge users, especially if they do not have frequent access to their base of operations or headquarters to get apps and updates.
Our Approach
To help military personnel in edge environments, mobile applications must
exploit available sensors so that contextual information can be captured with easy and minimal user interaction
process, store, and forward sensed and captured information
be resilient to intermittent communications connectivity and opportunistic in using communication capabilities as they become available
manage resources on the mobile devices so that they are used as efficiently as possible to maximize the availability of the system by reducing power consumption
At the Carnegie Mellon University Software Engineering Institute, the Edge-Enabled Tactical Systems Research Team focuses on helping warfighters and field personnel who use mobile devices in these environments. I conducted the research reported in this blog posting together with Bryan Yan, a recent graduate of CMU’s Master of Information Technology - Embedded Software Engineering (MSIT-ESE) program.
As outlined in our technical note on this research, Evaluation of the Applicability of HTML5 for Mobile Applications in Resource-Constrained Edge Environments,after defining the requirements of edge applications, we created a set of development features that mapped to these requirements. We then wrote simple HTML applications that implemented these features. Finally, we tested the code on the Google Chrome, Firefox for Android, and Dolphin web browsers on two different mobile devices with different versions of Android.
As an example, the code snippet shown below is an HTML5 application that supports geolocation. The code snippet registers the location tracking feature of the mobile device and reports the longitude, altitude, heading, speed and accuracy of the location in plain text:
Our results evaluating other HTML5 features as of December 2013 are as follows:
The Use of Bridging Frameworks
Given that not all features developed using HTML5 are supported, we next explored the use of bridging frameworks, also called cross-platform mobile development frameworks, to see if they could be used to implement the edge features not supported by HTML5. Bridging frameworks are software packages created by developers that create hybrid software applications merging native mobile application developments with the advantages of HTML5.
As described in our report, we investigated the use of PhoneGap to create a notification application and a battery status application for the Android platform, which are features that are not supported by the HTML5 standard alone. The hybrid mobile applications generated by PhoneGap appear as native applications to the mobile platforms because PhoneGap creates a single Android Activity application (i.e., a wrapper application) that uses the Activity’s onCreate () method to load an HTML5 web page containing all the application logic and presentation. The code snippet below shows the wrapper application source code for the notification demo application for the Android platform:
While hybrid applications allow developers to access the mobile platform’s native application programming interfaces (APIs) via the PhoneGap framework, these applications appear native to the mobile platforms and thus must be installed onto the device. This compromises one of the main benefits of software applications developed using only HTML5, which is the ability to run an application with only a browser.
Software Architecture Implications
We also looked at HTML5 with respect to maintainability, performance, and portability. For maintainability we specifically looked at the effect of evolution in HTML5, browser support for HTML5, and bridging frameworks on existing HTML5 applications. We concluded that based on the information we found in change logs related to HTML5 there would be no effect on existing applications. To migrate hybrid mobile applications to HTML5 mobile applications as new features become supported, however, application logic must be modified, though the user interface (presentation layer) of the mobile application can be preserved. Having a coding standard that localizes JavaScript code in specific areas of the HTML5 file or places it in external files can facilitate the process.
For performance we looked at specifically at execution time and memory usage and discovered that HTML5 mobile applications are over 3 times slower and use 3 times more memory than native Java Android applications, which is expected because the applications run inside a browser. The larger memory consumption and slower performance is a potential drawback of using HTML5 for edge applications, given limitations in computing and battery power.
Finally, for portability we looked at portability of HTML5 applications across mobile web browsers, in particular Firefox, Chrome and Dolphin, and found that portability of HTML5 mobile applications across mobile web browsers varies. We expect this to change as browsers start supporting more parts of the HTML5 specification, but this has been the major drawback of HTML5 and one of the reasons why developers turn to bridging frameworks.
Wrapping Up and Looking Ahead
Warfighters and first responders at the tactical edge face limited computing resources, intermittent or no connectivity to the enterprise, and the potential of large amounts of field-collected data. These environments are highly fluid and unpredictable due to factors like mission changes, threats, or changing weather conditions. Our analysis found that as of June 2014, the HTML5 standard still lacks support for many critical features needed to support development and deployment of applications for these types of edge environments. Even though we expect that HTML5 will evolve rapidly to support the development of edge applications given the tremendous support it receives from the developer community (e.g., the Firefox operating system is built entirely using HTML5), the use of HTML5 for edge applications for its portability and maintainability advantages will have to be evaluated against constraints in battery, computing power and screen size of mobile devices Researchers on the Edge-Enabled Tactical Research Team at the SEI will continue to examine HTML5 for use in mobile applications at the tactical edge.
We welcome your feedback on our research. Please leave feedback in the comments section below.
Additional Resources
To read the SEI technical note, Evaluation of the Applicability of HTML5 for Mobile Applications in Resource-Constrained Edge Environments, please visithttp://resources.sei.cmu.edu/library/asset-view.cfm?assetid=295973.
SEI
.
Blog
.
<span class='date ' tip=''><i class='icon-time'></i> Jul 27, 2015 02:04pm</span>
|
|
On Wednesday (02/04/14) Adrian and I started the next part of our journey and joined the project at Carers Allowance in Preston.
We arrived in time to join in the daily morning stand up. This is a short meeting for each person to let the team know where they are up to on their work and any issues they’ve got. We were impressed with how quickly the Scrum Master got through all 15 people (11 minutes!). We introduced ourselves and were given a very warm welcome.
We were invited to attend a User Feedback presentation by Ben Holliday. This was a very interesting session to see how the work the project is viewed by the end users (customers). All their feedback is considered and any amendments built in to future Sprints.
Following this we were invited to join in the planning session for the next Sprint. This gave us a really good insight into how Sprints are built up and why releases are done in small parts.
After lunch we set about making a plan for the objectives we have been set
Our objectives to learn and use during our "settling in"
Check back for updates on how we are getting on……………..
DWP Digital
.
Blog
.
<span class='date ' tip=''><i class='icon-time'></i> Jul 27, 2015 02:04pm</span>
|
|
In the job market, we’ve all come to accept a little bit of exaggeration. We don’t like to think of ourselves as liars, but whether it’s to make ourselves or someone else feel better, save someone from trouble, or to make a story all the richer, human beings come with the baggage of having to lie for a number of reasons. Lying is bad, lying can be good; lying can lead people to disastrous situations, lying can save a life. It’s the definition of a necessary evil. Fibbing on your resume or interview feedback...
SHRM
.
Blog
.
<span class='date ' tip=''><i class='icon-time'></i> Jul 27, 2015 02:03pm</span>
|
|
By Sarah SheardMember of the Technical Staff Software Solutions Division
The role of software within systems has fundamentally changed over the past 50 years. Software’s role has changed both on mission-critical DoD systems, such as fighter aircraft and surveillance equipment, and on commercial products, such as telephones and cars. Software has become not only the brain of most systems, but the backbone of their functionality. Acquisition processes must acknowledge this new reality and adapt. This blog posting, the second in a series about the relationship of software engineering (SwE) and systems engineering (SysE), shows how software technologies have come to dominate what formerly were hardware-based systems. This posting describes a case study: the story of software on satellites, whose lessons can be applied to many other kinds of software-reliant systems.
The Evolution of Satellites to Today and Beyond
My first professional job started in 1980 when I was hired as a satellite systems engineer at Hughes Aircraft Company in the space and communications division. At this time, there were a few career software developers, but individual engineers created most technical software for their own one-time use. The software that engineers developed ran on large and heavy computers that stayed earthbound, so it was not considered part of the satellite system. In contrast, software today provides a large fraction of the operational capability of satellite systems.
Satellite software initially appeared inside some of the hardware boxes, after the needed system functionality was allocated as requirements to the hardware elements. In contrast, in today’s software-reliant systems, the logic implemented by the software is the single dominant aspect of capabilities provided by the systems and must be considered early, perhaps before any other capabilities. This posting presents six figures that demonstrate the evolution from satellites that had little to no software in the 1970s, to satellites of today and beyond, which use software-driven services to link existing networks.
In all figures, the blue boxes that represent the satellite system are the same size for simplicity, even though cost, size, or weight have all grown significantly during this time. Red represents the software within the satellite. The amount of red is meant to suggest only roughly whether there is a little or a lot of software: the growth in actual lines of code has been exponential, far greater than the growth shown in the area of red.
Background
The satellites discussed in this post are geostationary communication satellites. These satellites are situated above the equator at the altitude whose orbit period is 24 hours, so the satellite appears always at the same point in the sky with respect to the rotating earth. Geostationary communication satellites have been known as "bent pipes" in the industry because the satellite receives communication signals from the ground and bends the same signals (amplified) back toward the ground.
Since satellites are space systems, they have different constraints from business systems or ground-based weapon systems. Launch weight is a huge determinant of cost (the rule of thumb in the 1980s was $20,000/pound). Radiation hardness is required to allow the satellite to pass through the Van Allen radiation belts unscathed and to survive solar winds and other orbital conditions. Temperatures are extreme outside the atmosphere, so satellite components must be qualified to large temperature ranges and/or protected with thermal control equipment, such as Kevlar thermal blankets.
Rocket technology made vast leaps in the 1960s race to the moon, allowing for more economical satellite launches that became commercially affordable. Early satellite technology challenges were hardware-focused, such as how to handle increasing numbers of channels with more output power and with better sensitivity to signals received from the ground, and how to maintain operations for more years (such as by increasing the fuel for orbit maneuvers). The weight constraint meant that technology improvements that shrank hardware, made it from lighter materials, or allowed more functions per box were highly desirable. It is clear now that software was the obvious answer, but it was not clear at the time, mostly because the computers on which the software ran were so heavy.
Software on satellites through the decades: 1960s and 1970s
In the 1960s satellites had little to no software since computers were too big and heavy to launch into orbit. As computer hardware shrank, it could be included on satellites, and satellite computer software developed. The first satellite computer software was embedded into the computer hardware. My colleague Patrick Place worked on the first reloadable satellite software, which was necessary because the launch window was prior to finalization of the software.
In the 1970s, computers were shrinking in size, and became feasible for use on satellites. The challenge was to obtain space-qualified chips of high capacity and low weight. I had a t-shirt from 1980 that celebrated the space-qualification of a RAM chip with ¼ Kbyte capacity (256 bytes)! Imagine the software constraints posed by that limited capacity, which was state-of-the-art at the time!
By the end of the 1970s, satellites included some processors that performed small amounts of computation. For example, software was part of the subsystem that controlled satellite orientation (attitude control subsystem). The 1970s figure shows one box on a satellite included a little software.
1980s
During the 1980s computers were getting smaller, lighter, and more powerful. Engineers of several subsystems found weight savings by using custom, space-qualified computer chips loaded with software. The 1980s figure shows there were a few boxes on the satellite that contain some software, including the attitude control processor, the telemetry and command boxes, and perhaps some payload boxes.
In the mid-1980s, I was involved in studying the tradeoff criteria when power subsystem engineers suggested they could save a great deal of weight by having the attitude control processor take over the logic for switching of solar panels between recharging batteries and supporting the power bus. Software would eliminate 10 power switches and their wire harnesses. The attitude control subsystem engineers were reluctant to take on logic that had nothing whatsoever to do with their subsystem, didn’t save them any weight, and complicated the software development and test process. (The outcome was that the switching was done in software).
As a systems engineer in the 1980s, I was only indoctrinated into the discipline by learning on-the-job. At the time, few other resources existed. Few textbooks existed; the Defense Systems Management College (DSMC) guidebook, project management books (also fairly new) and Hatley-Pirbhai structured requirements analysis were some of the few subjects with available instructional material.
1990s
By the 1990s, many of the subsystems and units on the satellite included computer hardware and software. More important, the software in many hardware units was beginning to interact with software in other units on the spacecraft, and point-to-point communications was established among them. The 1990s figure shows the boxes starting to communicate with each other.
2000s
By 2000, the satellite industry had started to view software as an important aspect of the entire spacecraft. Software within the various subsystems and boxes was being coordinated, protocols were established, and data buses were set up that ensured compatible messaging and rapid communication. In addition, the interfaces to the outside world were being shown on software architecture diagrams. The 2000s figure shows a typical architecture of the software modules in the satellite system.
2010s and Interoperability
Between 2000 and the present, the concept of interoperability of systems has matured. Initially interoperability meant being able to connect to other technological devices. Today, it means the ability to communicate and exchange data, using a common semantic information exchange model so that the numbers do not change meaning at different points in the transmission. Some consider use of open standards to be critical for true interoperability (unencumbered by patents, proprietary claims, or dominance by one market competitor). The need for this interoperability has led to common architectural practices that are known well by software engineers, but not necessarily by systems engineers.
By the 2010s, diagrams of satellites commonly looked like software architecture diagrams, with layers, protocols, and strong interaction with software outside the satellite, both to ground stations and to users beyond the satellite terminals. The 2010s figure shows the strong layering and the relationship of the layers to external software modules. The modules interact to perform much more interesting functions than individual modules would; this means their architecture is more important and depends greatly on their implemented quality attribute requirements.
At this point the satellite system box just looks like a boundary drawn around software. Of course, other views of the satellite exist that depict the electrical power distribution, physical box placement, gain and link budgets, deployment sweeps, propulsion routing, and the like. But importantly, it’s becoming more widely accepted that the software or logical view of the satellite should be well-engineered and developed as early as the first hardware specification.
Near Future
The last diagram shows my projection into the future. Satellite hardware is now seen more as a commodity, selected so that the satellite can provide services and interconnectivity of the right kind among existing and new networks. The environment contains large, mature, evolving networks of many types that may be considered as data sources or users of new, possibly fused or mined data. The United States Armed Forces have a number of different wartime and business networks, some built on old technology that should be modernized, and which have various levels of classification. Commercial industry has its own networks. For example, health care is connecting medical devices in doctors’ offices, hospitals, pharmacies, and even individuals without necessarily having worked out security implications. Satellites can be seen as deliverers of information among such networks. DARPA’s "System F6" or "fractionated spacecraft" program investigates effectively this: whether the tradeoffs to achieve the functionality can be best served if the software is split among multiple hardware nodes (small satellites), or in what cases a single hardware spacecraft is best.
I expect that systems of the 2020s will start with a recognition of what networks exist in the environment relevant to a new system and to its needed capabilities. The systems that will be built will have capabilities that essentially "hang" on the relevant networks like modular kitchen accessories hang on racks at the hardware store. System capabilities will be determined by what functions and responses can be created first by connecting data and processes available on one network with data and processes on others; then the system’s capabilities and hardware will be specified and designed.
In the software engineering field, it is time to recognize this evolution in the relationships of systems to software. Satellites are a microcosm of broader trends in both military and commercial system development, whether weapons systems, software embedded in hardware, or business systems. In the past, capability needs led to a plan to acquire systems, and the systems required pieces of software within them. In the future, capabilities will lead to a plan to acquire needed software, to connect existing networks with new kinds of data and processing, and this will drive the need for specific systems. Satellites, of course, are only an example.
Wrapping Up and Looking Ahead
The capabilities of the future demand attention to software, and software architecture, first. Systems engineers who participate in early acquisition must bring software architecture knowledge into early specification of system activities and move toward the day when it is the software architecture of the real world, and secondarily that to be specified within a system, that determines what systems are brought into being. In addition we need to involve systems engineers who bring particular systems into being to help them understand the repercussions of various systems engineering and acquisition decisions as far as software capability is concerned.
Naturally, not only software should be involved early. Systems engineers, with their broad views and customer-needs focus, must drive the early specification of capabilities, but they will not be able to do that without bringing the software engineers in as partners from day one.
In the next post in this series, I will discuss complexity as it relates to systems and software engineering.
Additional Resources
To read the blog post, The Value of Systems Engineering, by Joe Elm, please visit http://blog.sei.cmu.edu/post.cfm/the-value-of-systems-engineering.
A seminal systems engineering textbook was published in 1962 by Arthur Hall, A Methodology for Systems Engineering. The methods described in the book seem remarkably current, including eliciting requirements and looking for the business need behind the customer’s requests for a new system.
By 2000, many systems engineering texts had become available including those by Howard Eisner of George Washington University (Essentials of Project and Systems Engineering Management), Eberhardt Rechtin of the University of Southern California (Systems Architecting), and Richard Stevens et al. in the United Kingdom (Systems Engineering: Coping with Complexity). INCOSE had also released the first two versions of its systems engineering handbook by July 2000. None of these documents included much understanding of the important role software had come to play by that point, however.
In 1994 NCOSE participated in two efforts to define the body of practice of systems engineering: the Systems Engineering Capability Maturity Model (SE-CMM) and the Systems Engineering Capability Assessment Model (SECAM), NCOSE’s counterpart. By the late 1990s, NCOSE had become INCOSE, the International Council on Systems Engineering. The organization’s membership was growing and had played a role in merging the two systems engineering models and in writing five systems engineering standards.
SEI
.
Blog
.
<span class='date ' tip=''><i class='icon-time'></i> Jul 27, 2015 02:03pm</span>
|
|
As part of our continued development, we have been discussing the idea of holding a "hack day"
This is hacking in a good way, nothing to do with hacking into the Pentagon! We will have a group of a approximately 10, working in pairs on a set problem. Through pair programming, a solution will be written in code and tested in a "show and tell" to the rest of the group.
We have now arranged to hold the first DWP Hack Day in Preston on 16th April.
Check back for updates on how we get on!
DWP Digital
.
Blog
.
<span class='date ' tip=''><i class='icon-time'></i> Jul 27, 2015 02:02pm</span>
|
|
By James
Cebula
Senior Member of the Technical Staff
SEI CERT Division
This blog post was also co-authored by Lisa Young.
Organizations are continually fending off cyberattacks in one form or another. The 2014 Verizon Data Breach Investigations Report, which included contributions from SEI researchers, tagged 2013 as "the year of the retailer breach." According to the report, 2013 also
witnessed "a transition from geopolitical attacks to large-scale attacks on
payment card systems." To illustrate the trend, the report outlines a 12-month
chronology of attacks, including a January "watering
hole" attack on the Council on Foreign Relations website followed in
February by targeted cyber-espionage attacks against The New York Times and The
Wall Street Journal. The
well-documented Target
breach brought 2013 to a close with the theft of more than 40 million debit
and credit card numbers. This blog post highlights a recent research effort to
create a taxonomy that provides organizations a common language and set of
terminology they can use to discuss, document, and mitigate operational cybersecurity risks.
Foundations of Our Work
Organizations of all sizes in the public and private sectors
increasingly rely on information and technology assets that are supported by
people and facilities. An attack that disrupts these assets can be devastating.
In March, The Economist reported that the
Target breach "cost the company US $61m in response costs in the fourth quarter alone and helped fuel a 5.5 percent drop in transactions during the crucial
holiday shopping season."
For the purpose of drafting our taxonomy, we defined operational risks as
those arising due to the actions of people, systems and technology failures, failed internal processes, and external events
We defined operational cybersecurity risk as follows:
operational risks to information and technology assets that have consequences affecting the confidentiality, availability, or integrity of information or information systems
While there are a tremendous number of risks that organizations have no control over, organizations can control their understanding of risk in the context of business objectives. Specifically, organizations need to ask the following questions:
What are our most important mission aspects?
What threats and risks are we most concerned about?
It is not enough for organizations to understand what controls they are mandated to implement. For instance, federal agencies need to conduct a thoughtful consideration of risk specific to their business so that they can make financially responsible decisions when allocating resources. Our initial taxonomy, along with this latest effort, attempts to help organizations and federal agencies secure information systems and manage risk effectively.
As we outlined in the recently published SEI technical note, A Taxonomy of Operational Cybersecurity Risks, the taxonomy can be used as a tool to help identify all applicable cybersecurity risks within an organization. The impetus for revisiting and updating this taxonomy is the release in April 2013 of the National Institute of Standards (NIST) Special Publication 800-53 rev. 4, Security and Privacy Controls for Federal Information Systems and Organizations, which offers updated security controls for agencies. These controls are intended to protect federal agencies and should be applied tactically at the information-system level. To address expanding cyber threats, the NIST report addresses mobile and cloud computing; applications security; trustworthiness, assurance, and resiliency of information systems; insider threat; supply chain security; and the advanced persistent threat.
Our taxonomy is structured around a hierarchy that comprises the four main classes from the definition of operational risk
actions of people (or inaction) taken either deliberately or accidentally that impact cybersecurity
systems and technology failures including failure of hardware, software, and information systems
failed internal processes including problems in the internal business processes, that adversely affect the ability to implement, manage, and sustain cybersecurity, such as process design, execution, and control
external events including issues often outside the control of the organization, such as disasters, legal issues, business issues, and service provider dependencies
The taxonomy further decomposes each of the above listed classes into subclasses, and each subclass is described by its elements.
In attempting to address risk, organizations need to understand that risks can cascade: risks in one class can trigger risks in another. For example, a software failure due to improper security setting could be the result of any of the elements of inadvertent or deliberate actions of people. Organizations therefore need to analyze a particular risk by involving several elements from different classes.
Consider the following example regarding external events pulled from our technical note:
Class 4 External Events
External events describes a class of operational risk associated with events generally outside the organization’s control. Often the timing or occurrence of such events cannot be planned or predicted. The supporting subclasses of this class include disasters, legal issues, business issues, and service dependencies.
Subclass 4.1 Hazards
The hazards subclass deals with risks owing to events, both natural and of human origin, over which the organization has no control and that can occur without notice. The elements supporting this subclass include weather event, fire, flood, earthquake, unrest, and pandemic.
4.1.1 weather event—adverse weather situations such as rain, snow, tornado, or hurricane
4.1.2 fire—fire within a facility or disruption caused by a fire external to a facility
4.1.3 flood—flooding within a facility or disruption caused by a flood external to a facility
4.1.4 earthquake—disruption of organizational operations due to an earthquake
4.1.5 unrest—disruption of operations due to civil disorder, riot, or terrorist acts
4.1.6 pandemic—widespread medical conditions that disrupt organizational operations
Subclass 4.2 Legal Issues
The legal issues subclass deals with risks potentially impacting the organization due to the elements regulatory compliance, legislation, and litigation.
4.2.1 regulatory compliance—new governmental regulation or failure to comply with existing regulation
4.2.2 legislation—new legislation that impacts the organization
4.2.3 litigation—legal action taken against the organization by any stakeholder, including employees and customers
Subclass 4.3 Business Issues
The business issues subclass, described by the elements of supplier failure, market conditions, and economic conditions, deals with operational risks arising from changes in the business environment of the organization.
4.3.1 supplier failure—the temporary or permanent inability of a supplier to deliver needed products or services to the organization
4.3.2 market conditions—the diminished ability of the organization to sell its products and services in the market
4.3.3 economic conditions—the inability of the organization to obtain needed funding for its operations
Subclass 4.4 Service Dependencies
The service dependencies subclass deals with risks arising from the organization’s dependence on external parties to continue operations. The subclass is associated with the elements of utilities, emergency services, fuel, and transportation.
4.4.1 utilities—failure of the organization’s electric power supply, water supply, or telecommunications services
4.4.2 emergency services—dependencies on public response services such as fire, police, and emergency medical services
4.4.3 fuel—failure of external fuel supplies, for example to power a backup generator
Mapping to Existing Federal Initiatives
As with our initial taxonomy, which we also detailed in a technical note, this latest version also spans a broad swath of industries and explores how to reconcile risk mitigation efforts with recent federal government initiatives:
the Federal Information Security Management Act of 2002 (FISMA 2002), which applies to U.S. federal agencies and provides a standardized definition of information security that links identified operational cybersecurity risks to specific examples of consequences that impact confidentiality, integrity, and availability
security guidance provided in NIST special publications, which provide a control catalog that can be applied to federal information systems based on an analysis of the system’s relative importance and consequence of loss
the threat profile contained within the CERT Operationally Critical Threat, Asset, and Vulnerability Evaluation (OCTAVE) method. The OCTAVE method uses the concept of asset-based threat profiles. Octave uses four standard threat categories:
-human actors using network access
-human actors using physical access
-system problems
-other problems including failed internal processes and external events
For example, NIST SP 800-53 rev 4 includes, for the first time, a number of controls associated with mitigating insider threats. These and other new controls are now mapped into the taxonomy.
In Conclusion and Looking Ahead
Although estimates vary, a recent survey by Ponemon Institute estimated the cost of cybercrime in 2012 to be $8.9 million per company. As shown by the Verizon DBIR report that was mentioned in the introduction, the nature of cybersecurity will continue to evolve on new fronts, prompting new revisions to our taxonomy. In the meantime, we are field testing our taxonomy with various organizations subject to regulatory compliance and risk tolerance. The results of these field tests will also inform future revisions to this taxonomy.
We welcome your feedback on our research in the comments section below.
Additional Resources
To read the SEI technical report, A Taxonomy of Operational Cyber Security Risks, Version 2, please visit http://resources.sei.cmu.edu/library/asset-view.cfm?assetid=91013.
SEI
.
Blog
.
<span class='date ' tip=''><i class='icon-time'></i> Jul 27, 2015 02:01pm</span>
|
|
On 16th April we held the first DWP Hack Day in Preston. The day was attended by 10 enthusiastic people within DWP IT, including the DWP Developer team.
Attendee’s were;
Paul Anderson
Paul Bowman
Wayne Hopkins
Nathalie Harrold
Donna Kirk
Mike Morris
Suzanne Russell
Daniel Sherman
Adrian Stone
Ian Tidmarsh
The day was delivered by a Raspberry Pi trainer, Les Pounder, who specialises in delivering Hack Days to inexperienced people.
The day started in a unusual style. We had to programme a robot, without using a computer. Or a robot……..
We split in to 3 teams, each nominating a team member to adopt the role of "robot" and set about inventing our own language to direct our human around a maze. This was a very useful exercise in teaching the structure of programming and also understanding that you can achieve a lot by using clear and accurate commands.
Following a short break, we fired up the Raspberry Pi’s and began our computer learning of the day. We were using a programme called "Scratch".
Scratch in action
This is an excellent tool to teach the structure and concepts of code without the horror of seeing pages of symbols and letters. This allows you to place blocks and snap together groups of commands. We worked in pairs, using a technique called "Pair Programming" where one person "Navigates" and the other person "Drives". This means, one person types the code, drives, and the other person watches and directs where needed, navigates.
"Hackers" pair programming
We broke for lunch and returned for the main event. Building and coding a real robot. It was more Wall-E than Robocop, but exciting nonetheless!
After working out which pins power which motor (and getting them to spin the wheels the same way!) we set about building a series of blocks in Scratch to make the robot move around a maze with keyboard input. This worked very well and allowed us to see how our code was working with Left, Right and Forward keyboard inputs.
Following this, we then transferred our blocks of Scratch into real blocks of Python code. This allowed us to automate the robot to move around the maze with no user input. We achieved this by fine tuning the length of time each motor spins.
Building a robot
Build complete
We also decided to blu-tac a whiteboard marker to the robot and make it into an elaborate Spiro-graph! This was an excellent way to visually demonstrate how loops work.
Raspberry Pi Drawing
As the day drawn to a close we learnt by splitting code up into pieces, it is not as daunting as it first appears and by applying logic, you will produce very impressive results.
DWP Digital
.
Blog
.
<span class='date ' tip=''><i class='icon-time'></i> Jul 27, 2015 02:01pm</span>
|
|
By Ian Gorton Senior Member of the Technical Staff Software Solutions Division
The term big data is a subject of much hype in both government and business today. Big data is variously the cause of all existing system problems and, simultaneously, the savior that will lead us to the innovative solutions and business insights of tomorrow. All this hype fuels predictions such as the one from IDC that the market for big data will reach $16.1 billion in 2014, growing six times faster than the overall information technology market, despite the fact that the "benefits of big data are not always clear today," according to IDC. From a software-engineering perspective, however, the challenges of big data are very clear, since they are driven by ever-increasing system scale and complexity. This blog post, a continuation of my last post on the four principles of building big data systems, describes how we must address one of these challenges, namely, you can’t manage what you don’t monitor.
The Challenges of Scale
While debate rages over the precise characteristics of a big data system, from a software-engineering perspective, the major characteristics are well understood. These characteristics revolve around the scale of applications and the inherent complexity that emerges as applications grow to manage and analyze ever-growing amounts of data. As I’ve written in earlier blog posts, the requirements of scalability mandate that we adopt new design and engineering approaches because many existing tenets of software engineering simply don’t hold at scale.
A fundamental characteristic of big data systems is that there will be an increasing number of software and hardware failures as the scale of an application increases. Component failure, in fact, must be seen as the norm, and applications must be designed to be resilient to failures and diligently handle them to ensure continued operations. Note that software failures induced by scale may occur in application components or in third-party components integrated into the system, both open-source and commercial.
A second characteristic is that, as scale increases, so does complexity. There are more component interactions, increasingly unpredictable request loads on data collections, and increased competition for shared resources, including central processing units (CPUs) on multicore nodes, memory, cluster interconnects, and disks. This inherent complexity and non-deterministic behavior makes diagnosing aberrant behavior an immense challenge. If performance suddenly becomes a problem for particular transactions, it can be immensely time-consuming and challenging to diagnose, whether the cause lies in the transaction implementation itself or is a result of unexpected interactions with other components.
A third characteristic of big data systems is that scale makes thorough testing of big data applications before deployment both impractical and infeasible. For example, if your application will store and analyze one petabyte of data, it is probably impractical to create a test environment and test data at this scale. Moreover, even if you could test at deployment scale, as soon as your data grows, your code is operating beyond its tested tolerances. In reality, the only way to discover if your new components operate correctly is to deploy them on the production system and use techniques such as canary testing to validate their behavior.
Finally, it’s important to see these challenges of failure handling, complexity, and testing in the context of contemporary big data system deployments. The scale of companies like Netflix, which uses Cassandra to manage data on 750 nodes as part of its cloud-based software infrastructure, are well documented examples that herald the future for many large government and business organizations.
Even today, there are many conventional government and business systems that require massive resources to satisfy their requirements. Here are some examples from the healthcare domain. The United Kingdom’s National Health Service (NHS) Spine2 system is being initially deployed on hundreds of processors running on dozens of Linux-based commodity servers located across multiple data centers. Spine2’s database contains clinical records for each of the 70 million NHS patients, and this data is requested and amended more than 2.6 billion times a year. Spine2 also connects and services more than 21,000 organizations and links over 27,000 software systems within these organizations. Ease of scalability to support information growth and high availability to support clinical care are key drivers behind the design of Spine2. Similar systems have been built to support the Danish medical records system, and, in the United States, the Military Health System stores several petabytes of patient and clinical data and integrates with a large number of external health providers.
Observability: Fundamental to Successful Big Data Systems
The term observability defines a collection of capabilities that make it possible to monitor, analyze, and proactively and reactively respond to events that occur at runtime in a big data system. As big data systems scale, it must be possible to observe and reason about changes in behavior so the system can be operated and evolved reliably. The adage "you can’t manage what you don’t monitor" is especially true for complex, distributed systems that have an overwhelming number of moving parts, both hardware and software, that interact with each other in many subtle and unanticipated ways.
Here’s a simplified example of the problems that can arise in big data systems. Two separately developed, completely independent business transactions were providing the expected response times in operations across a horizontally partitioned database. Suddenly, one transaction slowed down, intermittently, making it occasionally non-responsive to user needs. Extensive investigations over several days, including detailed logging in production and attempts to recreate the situation in test, eventually led to identifying the root cause. Essentially, periodic and brief request spikes for one of the transactions were overloading a small number of database nodes. During these overload conditions, when the other transaction was invoked, it attempted to read data using a secondary index that is distributed across all nodes. These secondary index reads from the overloaded nodes were taking tens of seconds to respond, leading to unacceptable latencies for those transactions.
How could an observability framework have helped discover the root cause of this problem more quickly? If the developers could have analyzed performance data to visualize transaction volumes for the first transaction against latencies for the second, it would have been immediately obvious that there was a correlation. This visualization would have highlighted the areas of the code that should be investigated as the root cause of the high transaction latencies stemmed from this subtle interaction .
Of course, this type of performance data does not come for free from databases or web and applications servers. Capturing the necessary performance data to perform this type of analysis requires
applications to be instrumented with application-relevant measures for observability
a data collection capability to capture and store observability data
analysis capabilities for developers and operators to gain rapid insights from observability data
By necessity, internet companies operating at immense scales have built their own observability solutions. These solutions are extensive and powerful and have been built at considerable cost specifically for each operational environment. In recent discussions, we discovered teams building and evolving observability solutions comprising 10 to 20 people in large internet organizations. The resources used by these solutions, both at runtime and in terms of storage, ranged from 3 percent to 18 percent of the total system resources and in one case monitored approximately 500 million separate metrics per minute. When systems execute on thousands of nodes with petabytes of data, a lot of resources must be dedicated to observability, which is a testament to the necessity of this capability. An example description of Twitter’s solution in this area can be found here. Likewise a description from Netflix on its efforts to design and build its next generation of innovative insight tools for operational visibility is available here.
Observability Stack
The figure below represents an abstraction of the common components of an observability solution, organized as a layered architecture. We describe key capabilities required and inherent design trade-offs for each layer in the following:
Data collection: There are two elements to data collection: specifying what to collect and periodically gathering raw data from every node in the system. A component should be dynamically configurable to specify the application-relevant data to capture—this can be done using a framework such as Archaius. Measured data should then be transported to a distributed data collection capability that either periodically pulls data from every node or accepts data that is pushed from multiple nodes. Key design decisions in this layer revolve around the number of measures collected, at what frequency, and whether nodes themselves perform some local pre-preprocessing of data, with a potential loss of fidelity, or whether raw data is transported for further processing. These alternatives represent design trade-offs that can greatly affect the amount of resources used by the observability framework and its ability to scale.
Data aggregation: Aggregation is essential to constructing application-relevant metrics from raw data points. As a simple example, in a horizontally distributed database, the number of reads and writes served per time interval is the sum of the requests served by every database partition. Another example is for a business transaction level; the overall latency will be determined by the response times of all the lower-level services called. Should the business transaction itself measure these? Or, should the services themselves report their latencies so that they can be aggregated across all requests from different services? In terms of design trade-offs, the amount of aggregation performed will determine the type of queries that can be efficiently asked of the data and the amount of storage it consumes.
Data storage: Persisting observability data is necessary for downstream analysis purposes. Major design decisions revolve around whether to store raw as well as aggregated data, and how long the data should be kept. Observability data can quickly become a big data problem in itself, and hence the resources it consumes can become costly. This realization leads to design approaches that only keep fully aggregated and/or raw data for a specific time window (e.g., one month, six months), and dynamically create lower fidelity snapshots of the data for specific time periods (e.g., last day, last week) that are updated daily.
Data analysis: Both developers and operators need tools to query observability data so they can anticipate and diagnose problems. Querying observability data requires user interfaces for constructing queries and visualizing performance measures in a user-driven fashion. Users should also be able to define triggers that fire alarms when certain metrics stray out of expected ranges (e.g., 95th -percentile response times for a core service exceeds 500 milliseconds). Important design decisions revolve around the usability of the user interface in terms of the ease of constructing queries and visualizing results.
Existing Solutions
Distributed system monitoring tools have existed for many years. Commercial tools, such as AppDynamics, are comprehensive products. However, like their commercial counterparts in the database arena, license costs can become an issue as systems scale across hundreds and thousands of servers. Open source equivalents such as Nagios and Ganglia are also widely used, and a useful comparison of technologies in this space can be found here. Adopting these technologies and tailoring them to highly heterogeneous execution environments to observe application-relevant measures, as well as making them operate at the scale required by the next generation of big data applications, will represent a major challenge for any development organization.
At the SEI, we are researching the creation of highly configurable, scalable, and extensible observability solutions. In this context, being configurable requires tools that would allow a development team to rapidly configure the data collection requirements and trade-off analytic fidelity against resource usage, both for data collection and storage. Being scalable requires a lightweight core framework that is engineered from the ground up to ensure extreme scalability. Finally, being extensible demands rapid modifications are possible to interact with new application endpoints (e.g., databases and middleware) that are introduced during system evolution. Our aim is an observability framework that can be adopted in a cost-effective manner and rapidly customized by a wide range of organizations to help attain high levels of assurance and reliability.
If you're interested in this work, we’d like to hear from you. Please leave a comment below.
Additional Resources
To view Ian Gorton’s webinar, Software Architecture for Big Data Systems, please visithttp://resources.sei.cmu.edu/library/asset-view.cfm?assetid=298346.
To listen to the podcast, An Approach to Managing the Software Engineering Challenges of Big Data with Ian Gorton and John Klein, please visithttp://www.sei.cmu.edu/podcasts/podcast_episode.cfm?episodeid=294249&wtPodcast=AnApproachtoManagingtheSoftwareEngineeringChallengesofBigData.
SEI
.
Blog
.
<span class='date ' tip=''><i class='icon-time'></i> Jul 27, 2015 02:00pm</span>
|



