Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!
Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!
As more and more companies look to the cloud for their data storage needs, the same questions are being asked over and over again: if a security breech were to occur, who would be responsible and how much of a guarantee could the provider give us that our data would be safe from prying eyes.
The new Admin4 release is the first to contain a greatly enhanced PostgreSQL module. It includes:
Favourites, Snippets and Presets are stored centrally in the database, making them available for roaming usage. Your day-to-day code snippets, standard filters and frequently used objects are right there no matter which workstation you're using to maintain the database.
Download it here from sourceforge, it's free (Apache2 licensed). Contributors/testers for this fresh project are welcome!
The 2014 Postgres Open Program Committee is pleased to announce the opening of Early Bird Tickets and Tutorial Registration sales.
Early Bird Tickets: Book early through July 6, 2014 to save $200 on your registration! Postgres Open has received such strong support from our community sponsors that weâ€™re retaining our ticket prices from 2013 - no increase this year.
Conference Tutorials: Weâ€™ve brought in some new presenters and topics this year for our tutorial sessions held Wednesday, September 17th.
In addition, Heroku is offering a free platform tutorial covering dynos and their environment, buildpacks, deployment, releases, HTTP routing, add-ons, and of course Postgres.
Early Bird Tickets and All Tutorials can be purchased here: https://postgresopen.org/2014/tickets/
The CfP has been extended through Sunday June 8th: Weâ€™ve received some impressive submissions thus far, however a few of you have asked for some extra time, as such weâ€™re giving you until this Sunday June 8. https://postgresopen.org/2014/callforpapers/
We look forward updating you in the coming weeks and with more speaker info and the presentations schedule for Postgres Open 2014!
Yesterday, at the Hadoop Summit, Microsoft announced that Azure HDInsight now supports Hadoop 2.4. This is the next release of our 100 percent Apache Hadoop-based distribution for Microsoft Azure. In conjunction with Hortonworks recent release of HDP 2.1 as a Hadoop on Windows offering and Analytics Platform System as a Hadoop and data warehousing appliance, this release of Azure continues our strategy of making Hadoop accessible to everybody with Hadoop in the cloud.
This release of HDInsight is important because it has the latest benefits of Apache Hadoop 2.4 which provides order magnitude (up to 100x) performance improvements to query response times and continues to leverage the benefits of YARN (upgrading to the future “Data Operating System For Hadoop 2.0”). Finally, we are also providing an easy-to-use web interface that gives users of HDInsight a friendly experience. You can create queries with a graphical user interface to issue Hive queries.
The 100x improvements to query response times is due to the Stinger initiative where Microsoft in collaboration with Hortonworks and the open source software community (OSS) have brought some of the technological breakthroughs of SQL Server to Hadoop. We are excited to see Microsoft-led contributions bring generational improvements to Hadoop.
Since the Azure HDInsight’s release on October, 2013, we have seen tremendous momentum of customers deploying Hadoop in the cloud. Beth Israel Deaconess Medical Center, a teaching hospital for Harvard Medical School is using HDInsight to process large amounts of unstructured log data and to maintain their stringent requirements of data retention (that can be as long as 30 years). Virginia Polytechnic Institute is using the power of HDInsight to analyze massive amounts of DNA sequencing data. More of these examples can be read on CIO magazine who recently highlighted several HDInsight customer stories.
With Hortonworks HDP 2.1 For Windows, Microsoft Analytics Platform System, Microsoft Azure, Microsoft customers have an unprecedented number of options to deploy Hadoop on-premise, in the cloud or hybrid. We invite you to learn more through the following resources:
Hadoop Summit kicked of today in San Jose, and T. K. Rengarajan, Microsoft Corporate Vice President of Data Platform, delivered a keynote presentation where he shared Microsoft’s approach to big data and the work we are doing to make Hadoop accessible in the cloud. At the event, we also announced that Azure HDInsight, our Hadoop-based service in the cloud, now supports Hadoop 2.4.
Investing in Hadoop
Hadoop is a cornerstone to our approach of making data work for everyone. As part of this bet we have fully embraced the Hadoop ecosystem and have prioritized contributing back to the community and Apache Hadoop-related projects e.g. Tez, Stinger and Hive. All told, we’ve contributed 30,000 lines of code and put in 10,000+ engineering hours to support these projects, including the porting of Hadoop to Windows. We’ve done this in partnership with Hortonworks, a relationship that ensures our Hadoop solutions are based on compatible implementations of Hadoop. One of the results of that partnership is the engineering work that has led to the Hortonworks Data Platform for Windows and Azure HDInsight.
The massive scale, power, elasticity and low cost of storage, makes the cloud the best place to deploy Hadoop. That’s one of the reasons we have invested heavily in our cloud-based Hadoop solution, Azure HDInsight, which combines the best of open source with the flexibility of cloud deployment. It’s also integrated with our business intelligence tools, enabling easy access and transformation of data from HDInsight to Excel and Power BI for Office 365.
Today we are providing an update to Azure HDInsight with support for Hadoop 2.4, the latest version of Hadoop. This update includes interactive querying with Hive using advancements based on SQL Server technology, which we are also contributing back to the Hadoop ecosystem through project Stinger. With this update to HDInsight, customers can use the speed and scale of the cloud to gain a 100x response time improvement.
HDInsight is just one part of our comprehensive data platform, which includes the building blocks customers need to process data anywhere it lives and in the format where it is born, whether they use Microsoft Intelligent Systems Service to capture machine-generated data within the Internet of Things, SQL Server or Azure SQL Database to store and retrieve data, Azure HDInsight to deploy and provision Hadoop clusters in the cloud, or Excel and Power BI for Office 365 to analyze and visualize data.
This blog post will highlight PolyBase’s truly unique approach focusing on:
In the very recent past, various SQL over Hadoop/HDFS solutions have been developed, such as Impala, HAWQ, Stinger, SQL-H, Hadapt to name just a few. While there are clear technical differences between the various solutions, at a high level, they are similar in offering a SQL-like front end over data stored in HDFS.
So, is PolyBase yet another similar solution competing with these approaches? The answer is yes and no. On first glance, PolyBase is a T-SQL front end that allows customers to query data stored in HDFS. However, with the recently announced Analytics Platform System (APS), we have updated PolyBase with new syntax to highlight our extensible approach. With PolyBase, we bring various Microsoft data management services together and allow appliance users to leverage a variety of Azure services. This enables a new class of hybrid scenarios and reflects the evolution of PolyBase to a true multi-data source query engine. It allows users to query their big data – regardless of whether it is stored in an on-premises Hadoop/HDFS cluster, Azure storage, Parallel Data Warehouse, and other relational DBMS systems (offered in a future PolyBase release).
Complete Data Platform with PolyBase as key integrative component2. Freedom of Choice
One important key differentiator of PolyBase compared to all of the existing competitive approaches is ‘openness’. We do not force users to decide on a single solution, like some Hadoop providers are pursuing. With PolyBase, you have the freedom to use an HDInsight region as a part of your APS appliance, to query an external Hadoop cluster connected to APS, or to leverage Azure services from your APS appliance (such as HDInsight on Azure).
To achieve this openness, PolyBase offers these three building blocks.
Building blocks for PolyBase
The syntax for using PolyBase is simple and follows familiar T-SQL language constructs.
T-SQL for creating external data sources (Azure, external Hadoop cluster, HDI region)
T-SQL for creating external file formats (delimited text files and Hive RCFiles)
T-SQL for creating external tables (for Azure, external Hadoop cluster, HDI region)
A user can now create statistics for each of the external tables shown above to improve the query performance. We extended SQL Server’s mature stats framework to work against external tables in the same way it works against regular tables. Statistics are crucial for the PolyBase query engine in order to generate optimal execution plans and to decide when pushing computation into the external data source is beneficial.
While other SQL over Hadoop solutions (e.g. Impala, Stinger, and HAWQ) have improved, it remains true that they still cannot match the query performance of a mature relational MPP system. With PolyBase, the user can import data in a very simple fashion into PDW (through a CTAS statement, see below), use the fast SQL Server column store technology along with the MPP architecture, or let the PDW/PolyBase query optimizer decide which parts of the query get executed in Hadoop and which parts in PDW. This optimized querying, called split-based query processing, allows parts of the query to be executed as Hadoop MR jobs that are generated on-the-fly completely transparent for the end user. Thereby, the PolyBase query optimizer takes into account parameters such as the spin-up time for MR jobs and the generated statistics to determine the optimal query plan.
In general, if it comes to performance the answer usually is ‘it depends on the actual use case/query’. With PolyBase, the user has total freedom and can leverage capabilities of PDW and/or Hadoop based on their actual needs and application requirements.
PolyBase in APS bridging the gap between the relational world, Hadoop (external or internal) and Azure
The T-SQL statement below will run across all data sources combining structured appliance data with un/semi-structured data in external Hadoop, internal HDInsight region, and Azure (e.g. historical data) –
T-SQL SELECT querying external Hadoop, HDInsight & PDW regions, and Azure
SELECT machine_name, machine.location
FROM Machine_Information_PDW, Old_SensorData_Azure, SensorData_HDI, SensorData_ExternalHDP WHERE Machine_Information_PDW.MachineKey = Old_SensorData_Azure.MachineKey and Machine_Information_PDW.MachineKey = SensorData_HDI.MachineKey and Machine_Information_PDW.MachineKey = SensorData_ExternalHDP.MachineKey and SensorData_HDI.Temperature> 80 and Old_SensorData_Azure.Temperature > 80 and SensorData_ExternalHDP.Temperature > 80
This query example shows how simplicity and performance are combined at the same time. It shows three external tables referring to three different locations plus one regular (distributed) PDW table. While executing the query, the PolyBase/PDW query engine will decide, based on the statistics, whether or not to push computation to the external data source (i.e. Hadoop).
Rewriting & Migrating existing applications
Finally, you may have heard that Hadoop is ‘cheaper’ than more mature MPP DBMS systems. However, what you might not have heard about is the cost associated with rewriting existing applications and ensuring continued tool support. This goes beyond simple demos showing that tool ‘xyz’ works on top of Hadoop/HDFS.
PolyBase does not require you to download and install different drivers. The beauty of our approach is that external tables appear like regular tables in your tool of choice. The information about the external data sources and file formats is abstracted away. Many Hadoop-only solutions are not fully SQL-ANSI compliant and do not support various SQL constructs. With PolyBase, however, you don’t need to rewrite your apps because it uses T-SQL and preserves its semantics. This is specifically relevant when users are coming from a ‘non-Java/non-Hadoop world’. You can explore and visualize your data sets either by using the Microsoft BI solutions (initiated on-premises or through corresponding Azure services) or by using the visualization tool of your choice. PolyBase keeps the user experience the same.3. Simplified ETL & Fast Insights
It’s already a painful reality that many enterprises store and maintain data in different systems that are optimized for different workloads and applications, respectively. Admins are spending much time moving, organizing, and keeping data in sync. This reality imposes another key challenge which we are address with PolyBase – in addition to querying data in external data sources, a user can achieve a simpler and more performant ETL (extraction, transformation, loading). Different than existing connector technologies, such as SQOOP, a PolyBase user can use T-SQL statements to either import data from external data sources (CTAS) or export data to external data sources (CETAS).
T-SQL CETAS statement to age out Hadoop & PDW data to Azure
CREATE EXTERNAL TABLE Old_Data_2008_Azure WITH (LOCATION='//Sensor_Data/2008/sensordata.tbl', DATA_SOURCE=Azure_DS, FILE_FORMAT=DelimText2) AS SELECT T1.* FROM Machine_Information_PDW T1 JOIN SensorData_ExternalHDP T2 ON (T1.MachineKey = T2.MachineKey) WHERE T2.YearMeasured = 2008
Combines data from external Hadoop and PDW sources and stores the results in Azure
Under-the-covers, the PolyBase query engine is not only leveraging the parallelism of an MPP system, it also pushes computation to the external data source to reduce the data volume that needs to be moved. The entire procedure remains totally transparent for the user while ensuring a very fast import & export of data that greatly outperforms any connector technology offered today. With the CTAS statement, a user can import data into the relational PDW region where it stores the data as column store. This way, users can immediately leverage the column store technology in APS without any further action.
T-SQL CTAS statement for importing Hadoop data into PDW
CREATE TABLE Hot_Machines_2011 WITH (Distribution = hash(MachineKey), CLUSTERED COLUMNSTORE INDEX) AS SELECT * FROM SensorData_HDI where SensorData_HDI.YearMeasured = 2011 and SensorData_HDI.Temperature > 150
Combines PolyBase with column store – Imports data from Hadoop into PDW CCI tables
In summary, PolyBase is more than just another T-SQL front end over Hadoop. It has evolved into a key integrative component that allows users to query, in a simple fashion, data stored in heterogeneous data stores. There is no need to maintain separate import/export utilities. PolyBase ensures great performance by leveraging the computation power available in external data sources. Finally, the user has freedom in almost every dimension whether it’s about tuning the system and getting the best performance, choosing their tools of choice to derive valuable insights, and to leverage data assets stored both on-premises and within the Azure data platform.
Watch how APS seamlessly integrates data of all sizes and types here
Learn more about APS here
What does it take to become a database administrator, or what kinds of traits should I be looking for when I am hiring a DBA. Those traits can be summarized it two categories: Technical and Personal. In this article, Greg Larsen discusses the technical traits a DBA should have.
Integration with Microsoft's Azure and new business intelligence and Big Data tools are among the most striking features of SQL Server 2014, finds reviewer Paul Ferrill.
Virginia Tech is using the Microsoft Azure Cloud to create cloud-based tools to assist with medical breakthroughs via next-generation sequence (NGS) analysis. This NGS analysis requires both big computing and big data resources. A team of computer scientists at Virginia Tech is addressing this challenge by developing an on-demand, cloud-computing model using the Azure HDInsight Service. By moving to an on-demand cloud computing model, researchers will now have easier, more cost-effective access to DNA sequencing tools and resources, which could lead to even faster, more exciting advancements in medical research.
We caught up with Wu Feng, Professor in the Department of Computer Science and Department of Electrical & Computer Engineering and the Health Sciences at Virginia Tech, to discuss the benefits he is seeing with cloud computing.
Q: What is the main goal of your work?
We are working on accelerating our ability to use computing to assist in the discovery of medical breakthroughs, including the holy grain of “computing a cure” for cancer. While we are just one piece of a giant pipeline in this research, we seek to use computing to more rapidly understand where cancer starts in the DNA. If we could identify where and when mutations are occurring, it could provide an indication of which pathways may be responsible for the cancer and could, in turn, help identify targets to help cure the cancer. It’s like finding a “needle in a haystack,” but in this case we are searching through massive amounts of genomic data to try to find these “needles” and how they connect and relate to each other “within the haystack.”
Q: What are some ways technology is helping you?
We want to enable the scientists, engineers, physicists and geneticists and equip them with tools so they can focus on their craft and not on the computing. There are many interesting computing and big data questions that we can help them with, along this journey of discovery.
Q: Why is cloud computing with Microsoft so important to you?
The cloud can accelerate discovery and innovation by computing answers faster, particularly when you don’t have bountiful computing resources at your disposal. It enables people to compute on data sets that they might not have otherwise tried because they didn’t have ready access to such resources.
For any institution, whether a company, government lab or university, the cost of creating or updating datacenter infrastructure, such as the building, the power and cooling, and the raised floors, just so a small group of people can use the resource, can outweigh the benefits. Having a cloud environment with Microsoft allows us to leverage the economies of scale to aggregate computational horsepower on demand and give users the ability to compute big data, while not having to incur the institutional overhead of personally housing, operating and maintaining such a facility.
Q: Do you see similar applications for businesses?
Just as the Internet leveled the playing field and served as a renaissance for small businesses, particularly those involved with e-commerce, so will the cloud. By commoditizing “big data” analytics in the cloud, small businesses will be able to intelligently mine data to extract insight with activities, such as supply-chain economics and personalized marketing and advertising.
Furthermore, quantitative analytic tools, such as Excel DataScope in the cloud, can enable financial advisors to accelerate data-driven decision-making via commoditized financial analytics and prediction. Specifically, Excel DataScope delivers data analytics, machine learning and information visualization to the Microsoft Azure Cloud.
In any case, just like in the life sciences, these financial entities have their own sources of data deluge. One example is trades and quotes (TAQ), where the amount of financial information is also increasing exponentially. Unfortunately, to make the analytics process on the TAQ data a more tractable one, the data is often triaged into summary format and thus could potentially and inadvertently filter out critical data that should not have been.
Q: Are you saving money or time or experiencing other benefits?
Back when we first thought of this approach, we were wondering if it would even a feasible solution for the cloud. For example, with so much data to upload to the cloud, would the cost of transferring data from the client to the cloud outweigh the benefits of computing in the cloud? With our cloud-enabling of a popular genome analysis pipeline, combined with our synergistic co-design of the algorithms, software, and hardware in the genome analysis pipeline, we realized about a three-fold speed-up over the traditional client-based solution.
Q: What does the future look like?
There is big business in computing technology, whether it is explicit, as in the case of personal computers and laptops, or implicit, as in the case of smartphones, TVs or automobiles. Just look how far we have come over the past seven years with mobile devices. However, the real business isn’t in the devices themselves, it’s in the ecosystem and content that supports these devices: the electronic commerce that happens behind the scenes. In another five years, I foresee the same thing happening with cloud computing. It will become a democratized resource for the masses. It will get to the point where it will be just as easy to use storage in the cloud as it will be to flip a light switch; we won’t think twice about it. The future of computing and data lies in the cloud, and I’m excited to be there as it happens.
For more information about Azure HDInsight, check out the website and start a free trial today.
MySQL Fabric is a new open-source tool included in MySQL Utilities 1.4
How do you ensure that the DBA doesn't, or can't, drop a table accidentally? Oracle has at least two ways to ensure that a table cannot be accidentally dropped but there are some limitations to those methods. David Fitzjarrell looks at those methods to see which one works for the DBA account.
SQL Maestro Group announces the release of PostgreSQL Maestro 14.5, a powerful Windows GUI solution for PostgreSQL database server administration and database development.
The new version is immediately available for download.Top 10 new features:
There are also some other useful things. Full press release is available at the SQL Maestro Group website.
Version 3.2.1 of DBD::Pg, the Perl interface to Postgres, has just been released. For more information and to download please visit:
A survey of Big Data professionals indicates that, while confusion remains, the market sector is growing quickly.
Edgenet provides optimized product data for suppliers, retailers and search engines. Used online and in stores, Edgenet solutions ensure that businesses and consumers can make purchasing and inventory decisions based on accurate product information. Last year, it implemented an In-Memory OLTP solution built on SQL Server 2014, which has helped it continue to innovate and lead in its business
We caught up with Michael Steineke, Vice President of IT at Edgenet, to discuss the benefits he has seen since Edgenet implemented SQL Server 2014.
Q: Can you give us a quick overview of what Edgenet does?
A: We develop software that helps retailers sell products in the home building and automotive industries. We work with both large and small dealers and provide software that helps determine and compare which products are in a local store.
We provide the specs, pictures, manuals, diagrams, and all the rest of the information that a customer would need to make an informed decision. We take all of this data, standardize it, and provide it to retailers and search engines.
With the major shift to online sales over the past handful of years, retailers need to have relevant and timely product information available so the customer can compare products and buy the best one for their needs.
In a single store, inventory is easy. In a chain where you have 1,000 or 5,000 stores, that gets very complicated. Our company is built on product data, and we need a powerful solution to manage it.
Q: What is your technology solution?
A: We are using In-Memory OLTP based on SQL Server 2014 to power our inventory search. This is where SQL Server 2014 comes in. Our applications make sure we have the right product listed, pricing and availability, and we couldn’t do it without In-Memory OLTP.
Q: What types of benefits have you seen since deployment?
A: SQL Server 2014 and OLTP have helped change our business. Our clients are happy as well. No matter what our customers need, we can do it with our solution. If a retailer wants to supply the data to us every 10 minutes, we can update every 10 minutes. It’s the way we like to do our business.
Q: Why did you choose to deploy SQL 2014 in your organization?
A: Working with Microsoft was a natural choice since we often are early adopters with new technologies. Our goal is to utilize new feature sets of new software as much as possible so we stay innovators in the field. That was the main reason we were so excited to deploy the In-Memory OLTP features with SQL Server 2014.
Q: What type of data are you managing?
A: Our inventory data isn’t extremely large, but there is a lot of volatility with it. We are talking about managing thousands of products across thousands of stores, with different pricing and availability for each store. There could be hundreds of millions of rows for just one retailer. Our big data implementation is around managing this volatility in the market, and we need a powerful back-end solution to help us handle all sorts of information.
Q: What are the advantages of In-Memory OLTP?
A: The biggest advantage we are getting is the ability to continually keep data up-to-date, so we always have real-time inventory and pricing. While we are updating we can continue to use the same tables, with little or no impact on performance. We were also able to consolidate a database used for the application to read that was refreshed daily and a database that consumed the updates from our customers, to one In-Memory database.
For more information about SQL Server 2014, check out the website and start a free trial today.
Former AOL database architect and independent database consultant John Schultz benchmarks basic MongoDB from MongoDB, Inc. and TokuMX, the Tokutek high-performance distribution of MongoDB, to see which one performs best at scale. Focusing on a number of important performance characteristics — such as storage I/O and database insertion rates – Schultz tracked the degradation in performance for both products as database size increased. Here are his results.