HDInsight Test Drive

Apache on Windows Azure = HDInsight

Apache on Windows Azure = HDInsight

 

 

 

 

 

 

 

Since Windows Azure HDInsight became generally available last month, I finally found some time this weekend to  give a test drive on it.

For those of you not familiar with Azure HDInsight, it is the Azure implementation of Apache Hadoop and licensed from HortonWorks, which provides Hadoop distribution on both Linux and Windows.

I started with this announcement about HDInsight generally available on Azure:

Windows Azure HDInsight is now Generally Available!

While this article seems to have some hands-on instructions about how to start HDInsight cluster on Azure, it is more of a marketing material (at least from techie’s prospective) so you should treat it as a leisure reading and refrain from playing out the commands in that article. In fact, some commands are so poorly formatted that I am sure you will get lost if you are new to HDInsight.

The following link is a better starting guide for HDInsight:

Get Started using Windows Azure HDInsight

Basically, you have to install 2 add-on’s to make your Windows PowerShell to work with Azure and HDInsight before you can play with HDInsight. I had problems with the first one, Windows Azure PowerShell. For some reason, the command prompt in that shell doesn’t display cursor position so it was very difficult for me to replace parameters from the sample commands. I did some online search, and that problem seems to happen to quite some people. The suggested work-round is to switch to PowerShell ISE and it worked for me.

That article also includes step-by-step guide about provision an HDInsight cluster and submit a word count (basically a “Hello World”) example  in MapReduce through PowerShell script/commands. At the end of the article, it shows the power of compatibility of Windows toolset, such as Office Excel, to natively connect to an HDInsight cluster and perform BI analytics. I have to give credit to Microsoft for that, given Excel is such a popular, widely-used, BI client tool.

The next 2 articles focus on the storage behind HDInsight, and how to move data in and out of HDInsight storage. Unlike the Linux distribution of Hadoop, where you can pretty much interact with a remote Hadoop cluster using HDFS commands from the local computer, you have to first remote desktop onto the HDFS cluster node and run commands there. I ran into login issue with remote desktop even with the correct username and password, and that problem resolved by itself on the next day, so HDInsight cluster still doesn’t seem to be stable enough.  For those of you interested in sqoop, you want to read through the 2nd link.

Use Windows Azure Blob storage with HDInsight

Upload data to HDInsight

Finally, the following 2 articles about using Pig and Hive on HDInsight:

Using Pig with HDInsight

Use Hive with HDInsight

Fairly easy to follow through examples once you are more comfortable Azure PowerShell commands, assuming you have some background about how Apache Pig Latin and Hive work.  However, in Azure PowerShell, the only way to execute Hive and Pig is through batch files, so I truly miss the interactive part of Hive and Pig in Apache implementation, where I can play with the data first in command prompt before putting commands into a batch file.

Overall, Microsoft finally brought a cloud-version of Apache Hadoop to Windows Azure, though it still appears to be in early stage and has quite a catch-up from Amazon Elastic MapReduce (Amazon EMR). However, if you are a Windows tech shop and already have data in Azure, this is probably a good starting point to plan big data analysis with HDInsight.

Final note: there is no way to suspend an HDInsight cluster even if you are not using it, so to avoid the billing, you better off just release (i.e.,  delete) the cluster after your MapReduce job finishes and re-provision another one next time.

Share

November 16, 2013 · admin · One Comment
Tags: , , , , ,  · Posted in: Cloud Computing, Hadoop

Presentation About Big Data

Here is a recent talk that I gave at Southern California Code Camp.

In the talk I covered basic concepts in world of big data and data science, with focus on relational databases, noSQL, MapReduce, machine learning, and data visualization, along with demos of MapReduce in action and Pig on Hadoop. The purpose of this presentation is to get you familiar with terminologies and concepts in data science, and whet your appetite for further exploration into the world of big data. This presentation is adapted from an online course by Coursera with similar title and scope.

Download me

Share

July 29, 2013 · admin · No Comments
Tags: , , ,  · Posted in: Cloud Computing, Database, Hadoop

7 Tips Using SQL Server Management Studio

If you are a database engineer, DBA, or developer working on data-centric applications, you probably use SQL Server Management Studio (SSMS) on a daily basis. While SSMS is very easy and intuitive to use, I have found the following 7 tips very useful that can really boost your productivity.

1. Additional Connection Parameters in the Connect window

In making a new connection, you can specify additional parameters to add or overwrite parameters under the main “Login” tab. A typical use case for me is when I need to connect to a SQL server in a different domain, such as production or cloud domain, that requires Windows-only authentication that is different from your current security login context, but the “User name” is grayed out and can’t be changed when you select “Windows Authentication” in the “Login” tab.

SQL Server Connection Main Login

SQL Server Connection Main Login

In this example, I need to use domain login “york\install” other than my current login “Hawk\Raymond”, so in the “Additional Connection Parameters” tab, you just need to specify User ID, Password, and Integrated Security. If you, like myself, can’t always remember the exact format for connection strings, you can easily look up from this site www.connectionstrings.com

Additional Connection Parameters

Additional Connection Parameters

2. Script multiple objects with Object Explorer Detail window

Script function is very powerful in SSMS, but if you need to script out multiple objects, such as tables, logins, or even SQL Server jobs, the Object Explorer (left side of the screenshot below)  doesn’t always allow you to select multiple using the normal Windows Control button. In that case, you can hit F7 button or use SSMS main menu “View” to bring up Object Explorer Details window (right side), where you can use your familiar Control or Shift button to make multiple selection, and then right click the selected area to bring up the scripting function. This comes particularly handy during server upgrade or migration when you need to script out all server-level objects.

script function in Object Explorer Details

script function in Object Explorer Details

3. Generate data insert script from tables.

Quite frequently you may have to script out all data in one or more tables in the form of INSERT statement to run in a different environment, such as for the purpose of deployment. While you can still painfully hand-type a SELECT statement to generate data in INSERT format (SELECT ‘INSERT INTO {table_name} Values(‘ & ‘blah blah’, you know what I am talking about), or install a 3rd-party utility such as sp_generate_inserts, SSMS has this function built in already, though not very obvious. To do that, in Object Browser window, right click the database node, Tasks-> Generate Scripts…, to bring up the Generate and Publish Script Wizard.

 Generate Scripts Menu

Generate Scripts Menu

After you select the desired table(s), click “Next” to take you to “Set Scripting Options” window:

Generate and Publish Scripts - Choose Objects

Generate and Publish Scripts – Choose Objects

Here click the “Advanced” button to bring up “Advanced Scripting Options”. In “Types of data to script”, change the default option of “Schema only” to “Data only” or “Schema and data”, then complete the wizard. Voila, now you have the INSERT statements generated nicely in your specified target file.

Generate and Publish Scripts - Advanced Option

Generate and Publish Scripts – Advanced Option

5. Template Explorer

SSMS comes with a set of commonly used script templates that you can drag and drop from Template Browser (on the right panel of the screenshot) into a query window. For example, after locating the template of “Create Unique Nonclustered Index”  from “Index” folder, and drag-and-dropping into query window (in the middle panel), you just need to specify values for the parameters, from SSMS menu Query->”Specify Values for Template Parameters…”, then you have a nicely-formatted script ready.

Note that you can further customize the template by adding your own folder and templates under each folder. I have used that to organize some very commonly used scripts for development or production operation. Just keep in mind that those template files are stored in your personal work space path such as C:\Users\Raymond\AppData\Roaming\Microsoft\SQL Server Management Studio\11.0\Templates\Sql, so you know where to retrieve those files in case you need to switch computers.

 

Template browser in SSMS

Template browser in SSMS

6. Use Code Snippets

This is similar to Template Brower just mentioned above, without drag-and-drop and replace. Anywhere in a query window you can right-click to bring up the context menu, and select “Insert Snippets…”, then you can follow the folder structure defined in Snippet Manager to insert the code snippet.

Code snippets in SSMS

Code snippets in SSMS

To bring up the Code Snippets Manager, just go to SSMS menu Tools->Code Snippets Manager, where you can further edit the snippet collection by adding or removing snippets. You can also tell by the location that code snippets are different from query templates.

Code Snippets Manager

Code Snippets Manager

7. Shortkeys

You can quickly execute pre-defined scripts in the current query window without even type them out. You can define those scripts through SSMS menu “Options…”, then under “Environment”->”Keyboard”->”Query Shortcuts”, you can define scripts for a total of 12 shortkeys (Ctrl+0 to Ctrl+9, Ctrl+F1, and Alt+F1). I have found those shortkeys particularly handy in production monitoring, where you can easily summon blocking-detection script or Adam Machanic’s famous sp_WhoIsActive to quickly narrow down the problem, without typing, inserting, or dropping scripts.

Shortkeys in SSMS

Shortkeys in SSMS

Share

February 22, 2013 · admin · No Comments
Tags: , , , , , , , ,  · Posted in: Database, SQL Server, Uncategorized

First Law of Microsoft Technologies

Recently I read the book of “Architecting Microsoft .NET Solutions for the Enterprise” by Dino Esposito and Andrea Saltarello (2008 Microsoft Press), where the authors summarized the below “First Law of Microsoft
Technologies” that I couldn’t agree more with:

First Law of Microsoft Technologies

First Law of Microsoft Technologies

Microsoft tends to supply a low-level framework and some higher-level tools such as controls, components, and designers. These tools are never designed to compete with analogous tools from third-party vendors. In addition, these tools target the average scenario and have the precise goal of doing a great job in most common situations. 

We’ve experimented with the “First Law of Microsoft Technologies” several times and in a variety of contexts—from ASP.NET controls and components to business logic models, from workflows to AJAX libraries, and including unit testing, code editing, code refactoring and, of course, Windows-based user interfaces. Keep this consideration in mind when you evaluate higher-level Microsoft tools and facilities. They’re excellent most of the time; they’re likely unfit for some situations. But you have to be a good judge of your needs to decide which option is exactly right for you! 

Some examples that I can think of to validate this First Law:

  • Calendar control in ASP.NET that forces an unnecessary postback. You can get around it, though, by using a later technology of ASP.NET AJAX (by placing into UpdatePanel), but read on
  • ASP.NET AJAX to adopt AJAX but was abandoned later for 3rd-party libraries like jQuery
  • Some ADO.NET wizards that can launch a database driven website with very small amount of coding

As the author pointed out, there is nothing wrong with some of those tools, they do a specific type of jobs very well and behave poorly beyond that. As an architect, you will need to be very careful when trying to use those tools in enterprise-level applications, you have to be aware what those tools cannot do.

To be fair, Microsoft has never been excellent at creating very original technologies, but they are good at adapting and improving. Understandably, you don’t always get everything done right at the first version. Microsoft is aware of that, and its product team has been constantly reviewing the portfolio and making adjustment – it is quite often to see them quietly removing some technologies off the shelf. LINQ-to-SQL is a good example. As an architect or just a professional working with Microsoft technologies, you will need to look at where Microsoft is coming from when promoting a new technology, and do you own homework before adopting it.

In the past few years, a lot of new technologies have started from the open source community, and it seems Microsoft is more of trying very hard to play the catch-ups. Big data/Hadoop is a great example. That sometime leaves those of us on Microsoft technology stack feel like we are at Microsoft’s mercy to bring it over and get it implemented right, and I don’t know if the gap keeps getting wider.

Let me end this article with a concrete example of to testify this First Law: I have a simple .NET solution in Visual Studio 2012, the latest and greatest IDE, with a few projects in it:

  • ProductEntityModel – the code-first, Entity Framework 5 assembly that I reverse-engineered against AdventureWorks sample database using Microsoft’s Entity Framework Power Tool. By the way, that tool itself is buggy, it doesn’t work with the SQL Server 2008 version of AdventureWorks database because it can’t recognize data type “Hirarchy”, and a few other annoyances, so I had to jump through several hoops to get the source files generated and compiled.
  • ProductService – a WCF project that I plan to deploy to my local IIS as a web service.

VS 2012 comes with Web Publishing Wizard so I decided to give it a try:

Web Publish Wizard

Web Publish Wizard

 

So I created a publish profile “Deploy to Local IIS”, then it asks me to information about the web service. I stumbled a little bit before I can get the connection validated correctly. Notice that you have to manually type in text “Default Web Site” in “Site/application”, and there is very limited information on this screen, so watch out, fat fingers :-)

Web Publish Wizard Step 2

Web Publish Wizard Step 2

The next screen asks for database connections, and also a disabled checkbox of “Execute Code First Migration”, with a warning below that you will need to read some articles to configure this. Not very intuitive, but I can understand that.

Web Publish Wizard step 3

Web Publish Wizard step 3

 

The next step is preview your settings and files before publish. After that when I click “Publish”, I got this error in Visual Studio, and nothing got published.

Web Publish Wizard Step 4

Web Publish Wizard Step 4

 

Alright, I get what the wizard was complaining, I have to go to IIS Manager to assign a .NET 4.0 AppPool to it. But, the point here is that, this wizard is very premuture for any serious us. You typically go through something like this:

  • first, “Oh great, I can use this tool to simplify my work”
  • then, “Mmm, what does this step mean…?”
  • finally, “Wait, you mean now I have do this and that in addition?…”
Share

January 13, 2013 · admin · One Comment
Tags: , , , , , , , ,  · Posted in: .NET, Architecture

Programming Lab

yutechnet labs

yutechnet labs

Recently I started exposing my lab projects, mostly programming, and some non-programming projects later on, through my website. You can access it from “Labs” in the main menu of this website.

I do that for the following two main purposes:

1. To be more hands-on with the latest computing technologies. The technology world is moving fast, and we all have to constantly keep ourselves up to date. In understanding a new technology, nothing beats that, after reading some books or whitepapers or articles, you immediately try it through some realistic projects to get a true grasp at it and get close to “true metal” – the code. I don’t always get the luxury of trying it in the day-to-day software deliveries, so I decided to keep some lab projects going for that purpose. Also, being at managerial position for a while, I don’t write a lot of code anymore, therefore I need those sideline projects to keep my coding skill from rusty. I have quite a long list that I need to do, such as Microsoft .NET Framework 4.5 and up, HTML 5, JavaScript and CSS 3, mobile, big data/Hadoop, but that obviously takes time so I will roll them out at my own pace. Anyway, that’s fun, right?

2. Keep it real. I could just try some sample code snippets or scripts here and there on my development boxes, but I want to keep them in a central production place. To take one step further, I will apply some formal software development principles when working on those projects:

  • Use git/github for source control and future collaborating
  • Use Fogbugz for project and release management, and bug tracking
  • Host in the cloud. I have a few EC2 instances with Amazon AWS, that serves that lab purpose very well. Btw, in terms of cloud computing, I rate Amazon as the clear front runner, ahead of Microsoft (Azure) and Google (App/Computing Engine)

So, let’s go, and stay tuned.

Share

January 5, 2013 · admin · No Comments
Tags: , , ,  · Posted in: .NET, Architecture, Cloud Computing, Google, Hadoop, mobile, Uncategorized