Wrangler User Guide

Quick Start on Wrangler

  1. Wrangler users are required to have an XSEDE User Portal (XUP) account, even for projects not allocated by XSEDE.
  2. Request an XSEDE Startup Allocation. XSEDE Startup Allocations are for one year and are typically quickly granted within a week. We encourage users who want to test out Wrangler or demonstrate the need for a larger allocation on the system to request an XSEDE Startup Allocation. Please visit the XSEDE Allocations page for more information on the processes.
  3. Once the allocation is granted, login to the Wrangler Data Portal using either your TACC or XSEDE credentials, to manage the group of users who can use the allocation and create reservations of blocks of time and/or use of high speed storage process and analyze data.
  4. While working with data on Wrangler, users can share data and results with colleagues and collaborators.
  5. Once results have been published or reviewed, users can publish data for use by the general research community.

Please see the Architecture section below for technical details about the makeup of the computing and storage capabilities of the Wrangler system.

Introduction to Wrangler

The Wrangler Data Analysis and Storage* system is designed for the needs for modern data researchers. Wrangler's unique architecture handles the many aspects of the volume, velocity, and variety that can make digital data research difficult to handle on standard high performance systems. The system is designed around a 0.5 PB high speed flash storage system that can be used to handle data analysis and processing workflows not practical on other systems with slower spinning disks or significantly smaller internal SSD storage devices.

Wrangler is dynamically provisioned by users in different ways to handle the different data workflows, including databases (both relational database systems and the newer noSQL style databases), Hadoop/HDFS based workflows (including MapReduce and Spark), and more custom workflows leveraging a flash-based parallel file system. In addition to data analysis and processing, Wrangler supports the data preservation, sharing, and publication needs for many data projects.

With two 10 PB file systems at TACC and the Indiana University, Wrangler presents users with both iRODS and Globus based data management systems that can be used separately or together to store data and results for the duration of a research project, share those results with collaborators and colleagues, and eventually publish the data into systems such as DataOne.

Wrangler Services

The Wrangler Data Analytics System provides the following environments to support different workflows:

Allocations

Computations on Wrangler will be constrained by three different components:

  • amount of Flash memory needed for a project
  • amount of RAM available on an individual node for computations
  • throughput of the CPUs in each node

We will allocate the analytics system based on a combination of all three of these. Each node allocated provides users with the two Haswell CPUs, 128 GB of DDR4 memory, and 4 TB of flash storage. These will be allocated as a single Node Unit for an hour. Project should calculate the number of Node Units needed for their project looking at these three components and request the maximum number of nodes needed to support their research. CPU cycles for shared dedicated services (e.g. databases) will not be charged and should only justify allocations based on storage needs.

For example, a project may need to work with their 40 TB dataset for 3 different months in the year with a code that needs 80 cores each with 5 GB of memory for adequate performance. Their need is then 4 nodes of processing and 10 nodes worth of flash storage. Thus they will request 21600 Node Hours.

The second portion of the allocation on Wrangler is for the long term disk storage. This storage is for holding the interim data needed by a project between their flash storage computation campaigns, for computations that can be carried out on the disk based storage, and for the long term housing, sharing, and exploration of data and results from the research. Users should request this as they do any other storage allocation. Users will have full read/write access to this storage, be it housed in databases, iRODS collections, or files in the Lustre file system for the full duration of the allocation. TACC reserves the right to remove write access to the storage once the allocation is completed.

Once a Wrangler allocation has been awarded, the Project PI or delegate must log into the Wrangler Data Portal to manage users on the allocation and select the services needed, e.g., storage, database instances, iRODS collection partitions, Hadoop reservations.

System Access

Wrangler is the first XSEDE system hosted at multiple sites. This configuration leads to some unique authentication mechanisms. Users accessing Wrangler via the XSEDE User Portal GSISSH mechanism will use their XSEDE credentials to login. Access to both the Indiana University systems and the TACC systems via ssh and other protocols will use the users TACC credentials.

SSH Access to Wrangler

Users may ssh to Wrangler directly at either TACC or XSEDE end points with respective credentials.

login1$ ssh -l TACC-username wrangler.tacc.utexas.edu

or

login1$ ssh -l XSEDE-username wrangler.iu.xsede.org

Direct Access via XSEDE Single Sign-On Hub and GSI-OpenSSH (gsissh)

Wrangler can also be accessed using the Grid Sign In mechanisms supported by XSEDE using your XSEDE credentials and GSI certificates. The following commands authenticate using the XSEDE myproxy server, then connecting to the GSISSH port 2222 on Wrangler at TACC:

localhost$ myproxy-logon -s myproxy.teragrid.org
localhost$ gsissh -p 2222 XSEDE-username@wrangler.tacc.xsede.org
Last login: Wed Jun 24 11:02:00 2015   
Wrangler LosF managed host  
Provisioned on 26-Jan-2015 at 17:43  
login1.wrangler(1)$ 

Alternately users can login to the XSEDE Single Sign-On Hub via their XSEDE credentials and then issue the gsissh command from that host:

localhost$ ssh XSEDE-username@login.xsede.org
[userid@gw69 ~]$ gsissh -p 2222 XSEDE-username@wrangler.tacc.xsede.org
Last login: Wed Jun 24 11:02:00 2015   
Wrangler LosF managed host  
Provisioned on 26-Jan-2015 at 17:43  
login1.wrangler(1)$

Users must use their XSEDE credentials to login to the myproxy or XSEDE User Portal sessions. Please consult NCSA's detailed documentation on installing and using myproxy and gsissh, as well as the GSI-OpenSSH User's Guide for more info.

Database Instances

Structured Query Language (SQL) database services, and the many related technologies which are grouped under the rubric "NoSQL", may be utilized in many different contexts, as part of data analysis workflows, for data collecting and warehouse activities, and as part of more complex software stacks used to support persistent web or other data services. Wrangler is intended to support most application areas where database technologies may be useful, through either persistent database services or "ad hoc" database services which can run as part of a compute node reservation and job workflows. This documentation will primarily cover the provisioning and usage of a persistent database, though many of the details will also be applicable to ad hoc databases.

Persistent Database Service Support and Network Access

Wrangler provisions persistent databases at both the Indiana and TACC sites. Users who intend to employ the flash storage component of Wrangler should select "TACC" in the Wrangler Data Portal.

Wrangler currently supports Postgres (with or without spatial extensions) and MySQL databases as persistent services which can be automatically provisioned through the Wrangler Data Portal. Other database technologies, including NoSQL software tools such as MongoDB, are also supported on a request basis; if you need a persistent database service which is not available through the Wrangler portal, please submit a help ticket and in most cases we will be able to assist you with provisioning such a service on Wrangler.

Once your database is provisioned, there are many mechanisms that will allow you to access this database, from simple command-line SQL utilities to ODBC plug-ins for graphical tools such as Excel and SAS, APIs for various compiled and interpreted programming languages, and web front-ends. We will not attempt to exhaustively document all the potential interfaces to these tools, but will provide the basic information required to configure most connection utilities along with examples for some common workflows including the use of command-line utilities, and encourage users to check the documentation for their intended client tools for more information on connecting to a remote database.

Wrangler provides MySQL support through the MariaDB variant of MySQL. MariaDB is binary-compatible with MySQL but has a richer feature set than traditional MySQL. See http://www.mariadb.org for more information on MariaDB-specific features.

Connecting to your Database

Wrangler will host instances of MySQL, Postgres/PostGIS, MongoDB, and other database technologies at both TACC and Indiana University. Databases will not support active transactional replication between the two sites, but replication of the database files will allow for fast service migration from one site to the other. Users will be responsible for their own database and its design, optimization and backups. The name of the database or schema will be defined in the Wrangler portal, and the username and password will be the same as those used to access other services at TACC and/or IU. Because sensitive information may be communicated over the database connection, SSL must be configured for all connections to the database.

Users with existing databases may upload to Wrangler as follows:

  1. Login to the Wrangler Data Portal: http://portal.wrangler.tacc.utexas.edu.
  2. Set up database instance, specify storage needs, and designate the Database Administrator
  3. A URL is generated for the new instance. A confirmation email will be sent containing directions on how to connect to the database and configure the local client.

You will need a few connection parameters in order to connect to your persistent database, primarily the hostname, network port, and database or schema name. The hostname of the database server will be supplied to you via e-mail once your database is provisioned, and can also be viewed in the Wrangler portal. Persistent databases will always be configured to utilize the default ports - these are listed below for each of the supported technologies:

Postgres/PostGIS 5432
MySQL 3306
MongoDB 27017

Secure Database Connections

Since you will use your TACC (or IU) username and password to access persistent database services on Wrangler, it is very important that you connect to the database using a Secure Socket Layer (SSL) network connection. All major database clients will have an option to enable and to configure SSL connections to the database server; you may need to consult the documentation for your client application, or if you are using the standard MariaDB/MySQL and Postgres clients, example connection commands are listed below.

MariaDB/MySQL

mysql -v --ssl --ssl-cipher=AES256-SHA -u TACC-username -h db1 dbname -p

Postgres

psql "sslmode=require host=db1.wrangler.tacc.utexas.edu  user=TACC-username dbname=dbname"

Database Administrators

Each PI will designate a Database Administrator (DBA) who will have full administrative privileges to manage the database and grant permissions for other users. All other users within the project will receive read-only access to the database, unless the DBA user grants them additional privileges.

By default all users on a project will have full read access to all tables, but only DBAs will have create/insert/alter/delete privileges. DBAs will also be responsible for enforcing any additional security implementations (e.g. restricting connection to the database from specific IP addresses or domains). TACC staff will be available to help new administrators work with the system.

For PostgreSQL, both a "public" and "restricted" schema are created under the project's database. The public schema has full privileges for all users associated with the project. The "restricted schema" implements the permissions described above. For additional information on how to change permissions for users, please consult the PostgreSQL documentation.

For MySQL/MariaDB, a single database is created, with the above permissions. The concepts of "schema" and "database" are one in the same. For additional information on how to change permissions for users, please consult the MariaDB documentation.

Database Administration Clients

There are many different clients for connecting to the databases supported on Wrangler. As a starting point for new users, we recommend familiarizing yourself with these standard clients for database administration.

PostGIS/PostgreSQL

PostGIS is an extension of PostgreSQL that provides support for the storage and analysis of spatial data (GIS - Geographic Information System). It provides two new data types (geometry and geography) which may represent points, lines or polygons; and a large number of functions that can be used to manipulate and query those spatial data types. For example, a characteristic query might ask whether or not a particular point falls within the confines of a given polygon or whether or not a line and a polygon intersect. Other manipulations might project a data set from one set of coordinates to another or calculate the area of a collection of polygons (e.g. counties). A tutorial is provided here: http://workshops.boundlessgeo.com/postgis-intro/

Any regular instance of PostgreSQL may be converted into a PostGIS instance by issuing the SQL command:

CREATE EXTENSION postgis;

Service Reservations

Since Wrangler is designed for data analysis, many researchers will be working with large and/or persistent datasets. A nifty new Wrangler feature is the ability to make system reservations, reserve all or part of the machine, specifying the number of nodes needed and accompanying data storage, for a specified date and duration. Users wishing to invoke a hadoop cluster can request a Hadoop Queue reservation and users wishing to to run one or a series of jobs against a persistent data set can request a Normal Queue reservation. The advantages of making System Reservations is the to perform numerous operations or runs on the persistent data for a period of up to 30 days. This obviates the need for constant staging and re-staging of large datasets over the same period. Users are responsible for transferring their data once the reservation ends (what happens if user forgets? before, or after reservation ends? how long after reservation ends? A Service Reservation also facilitates multiple users having access to the same persistent data

We strongly encourage all users to make service reservations. A service reservation will allocate a fixed set of nodes and accompanying flash storage resources for an extended period of time up to 30 days. This will allow for staging of data into the high-performance storage tier and for this data to persist over the course of many jobs during the course of the reservation. Users can reserve nodes and flash storage for Hadoop jobs or other non-HDFS based workflows or applications.

Normal Queue Reservations

Users wishing to analyze persistent data for an extended period can make a reservation in the normal queue. These reservations are synonymous with the standard batch jobs a user would submit to the normal queue on Stampede. Like Stampede, Wrangler employs the SLURM job scheduler.

A reservation sets aside both the compute nodes and their associated flash storage for a project's sole use. Once the reservation begins, the storage for all of the nodes will be made available to all of the nodes in the reservation.

During this reservation, users will need to submit jobs to use the reserved nodes by adding the reservation information to the SLURM job script or command line. For example, to use the reservation made for the project named "big_project253" the SLURM jobs would be submitted with the following additional command line argument:

login1$ sbatch --reservation=big_project253 myscript
sbatch: Submitted batch job 65540

Alternately the "idev" command can be used with the same "-r" argument to get interactive access to a compute node. Jobs will run on the reserved nodes when they are available. If members of a project are using all of the reserved nodes, new jobs will queue until they can begin.

For workflows using the flash storage systems with a reservation, users will be responsible for importing the data from the persistent disk based /data file system.

Typically this can be done in the first (or subsequent) scripts run against the reservation or using the "idev" command to get an interactive session and import the data by hand. Similarly it is the responsibility of the users to copy all data to be preserved from the jobs prior to the end of the reservation period. As the flash storage system is a limited commodity on Wrangler, data are not preserved between reservations. Thus we encourage users to make reservations to maximize their productivity by minimizing the need for frequent data migrations to and from this storage system.

  • estimate size of your job
  • storage needs for data and analysis during your computations
  • computing/cores needed for analysis
  • memory needed for analysis

For each node reserved, users will have access to

  • 24 cores on a node,
  • the 128 GB of memory on that node,
  • and up to 4 TB of storage on the DSSD storage system.

Users should employ the debug queue or other mechanisms to estimate the memory footprint and cores needed to run their computations on a single dataset and scale that to find the total size needed. For some projects the duration of a reservation will correspond to the period the project needs for analysis of a dataset, which may be significantly longer than the pure processing time. As transferring data to and from different storage devices can often be a significant time sync, we encourage users to stage data in the system for longer periods of time than the individual computations will take, especially for more interactive or iterative workflows. Once the scale and duration of a reservation is known, users will

Request a reservation via the Wrangler Data Portal and specify:

  • number of nodes required
  • starting and ending time period of the reservation.

During the reservation period, users can submit analysis jobs as SLURM batch scripts to the nodes under the reservations. SLURM is documented extensively in the Stampede User Guide.

  • Multiple users may submit the jobs to the same reservation
  • Multiple jobs will be queued until there is enough resources available in the reservation.

The reservation also includes access to the flash storage. The flash based file system will be available for users of the same project to access during the reservation period.

Prior to the end of reservation period, the user will be responsible for migrating all the essential data from the flash file system to the local file system. Data is subject to purge immediately after the end of the reservation.

Hadoop Queue Reservations

Users wishing to do analysis with Hadoop cluster must also make reservations via the TACC Data Portal.

  1. Create a hadoop cluster reservation through the Wrangler Data Portal.

    • a user should specific the at least one of the following: number of nodes required, total distributed data storage in HDFS, or the expected data to be processed. These information will be used by the reservation system to determine the number of data nodes needed for the reservation.
    • Specify the reservation start and end time
    • To enable additional users to access the Hadoop cluster, the user also need to specify a project of which all members will be added to the access of the reserved Hadoop cluster
  2. An email notification will be sent to the requestor's containing instructions on how to access the cluster once the reservation has started.
  3. During the Hadoop cluster reservation period, all users in the project specified at the time of the reservation will have access to the Hadoop cluster. The user need to submit a slurm job to the reservation in order to access the hadoop cluster and submit Mapreduce jobs. The slurm job can be an interactive job, a batch job or an VNC session job. For example, a user can start an interactive session to submit hadoop job for hadoop cluster or move data between hdfs and NFS. Once a hadoop job has been submitted to the Hadoop cluster, it will be managed and scheduled via YARN resource manager which is part of the Hadoop cluster. The slurm job session can be terminated once the user do not require further interaction with the hadoop cluster. The hadoop analysis job has been submitted to YARN scheduler will run to its finish even if the slurm job session is expired.
  4. The default Hadoop cluster will include hadoop core, spark, hdfs, mapreduce, hadoop streaming, and mahout packages for user to use. The users can install and/or request additional packages to be available with their hadoop reservation.
  5. Prior to the end of reservation period, the user will be responsible for migrating all the essential data from the HDFS to the local file system. The hdfs created during the reservation will not be preserved and extended beyond the reservation period.

View System Reservations with showres

Use TACC's showres utility to view all current Wrangler reservations:

login1$ showres -h
Usage: /usr/local/bin/showres [OPTION]
list current reservations in a clean format:
	-a : list reservation(s) for all users on system
	-u : list reservation(s) for a specific user
	-h : prints this message
login1$ showres -a
Reservation Name            State        Queue  #Nodes   Start Time     End time  Duration
hadoop+TG-CCR150011+817    ACTIVE       hadoop       2  09-14T14:05  10-14T14:05  30-00:00
dssd+TG-ASC150021+838      ACTIVE       normal       1  09-18T17:05  10-18T17:05  30-00:00
hadoop+WranglerTeam+896    ACTIVE       hadoop       4  09-29T15:45  10-08T15:45   9-00:00

Data Management with IRODS

Most data management tasks in Wrangler will require that your data be stored in iRODS. To get started with iRODS, log in to the Wrangler Data Portal and create a collection in iRODS; once the collection is created, you can begin loading data using an iRODS client or the iDrop web interface. Data will be checksummed as it is ingested, and the checksums will be saved for later comparison against checksums generated prior to transfer or at a later date, to ensure the fixity of your data over time. Audit logs of all iRODS activity are also collected, allowing for tracking of all operations on data stored in iRODS, who initiated the operations, and when they were performed. Additional data management functionality will be added over time, and custom policies can also be added to the iRODS system. Users with advanced data management needs or specific policies they are interested in employing are encouraged to contact the Wrangler team by submitting a ticket in the portal.

To create your iRODS collection, log in to the Wrangler portal and select the project with which the collection will be associated. In the project details page, you will see an "iRODS collections" table in the lower right corner of the page, as shown here:

Wrangler Data Portal

Click the "Create iRODS collection" button to start the process of setting up your iRODS collection. You will then be asked to give the collection a name. An iRODS "collection" is like a special version of a directory in a normal Unix file system - the collection corresponds to a path within the iRODS hierarchy, and this path can also be used to apply policies for data ingest, post-processing, auditing and so forth. Since the collection name will also be part of a path name used for accessing your data, we suggest you choose a short, descriptive name without spaces and other special characters that could cause problems when scripting or working with the path from the command line.

Create IRODS collection

You may also select whether this collection will require public web access when creating the collection in iRODS; if this checkbox is selected, the collection will be placed in a special location that can also be accessed by a public web server, meaning that all files and subdirectories you store in this collection will be immediately available on the open web. If you do create a public web collection, we suggest that you also create another collection without open web access enabled; this will allow you to upload data, verify that you wish to share it, and then move it from the private to the public collection. If you wish to share data with only a limited subset of collaborators, you do not need to select the "public web access" option - this option is only needed if you wish to share your data without any limitations on access.

Accessing iRODS on Wrangler

The most performant and functional mechanism for accessing iRODS is through the command-line utilities, known as the "i-commands". There is some general information on using the i-commands to perform various tasks in the TACC "iRODS User Guide" at https://portal.tacc.utexas.edu/software/irods. You can also utilize the iDrop web interface at https://web.wrangler.tacc.utexas.edu/idrop-web/ - there is additional documentation on using the iDrop web interface at https://portal.tacc.utexas.edu/tutorials/web-based-irods.

You can get access to the iRODS command line utilities on the Wrangler login and compute nodes by typing "module load irods". You will need to configure your environment for the Wrangler iRODS zone before attempting to connect; the simplest way to do this is to copy the example file into your home directory:

% mkdir ~/.irods
% cp /work/irods/irods_environment.json ~/.irods/irods_environment.json

Then open the file "~/.irods/irods_environment.json" in your favorite text editor and change "USERNAME" to your Wrangler username everywhere that it appears. Once this step is completed you can run "iinit" to log in to iRODS and begin managing your data. Your password will be the same as your TACC user portal and Wrangler login password.

Data Migration between iRODS and the Flash Storage

You can use the iRODS utilities to retrieve data into the Flash storage before running a job or from within a job, and you can also store your data in iRODS at the end of a job using the same utilities. For example, at the beginning of your job script, you could put the following commands to retrieve a full directory of data:

% cd FLASH_JOB_DIRECTORY
% iget -r /wranglerZ/home/USERNAME/JOB_DATA

At the end of a job or campaign, you can use a similar command to store data back to iRODS - the following command will copy a whole directory and all it's contents from the Flash storage into the iRODS system.

% iput -r FLASH_JOB_DIRECTORY /wranglerZ/home/USERNAME/JOB_OUTPUT_DIRECTORY/

Checksum and auditing functions are automatically enforced by the iRODS rule engine, so you do not need to explicitly specify that you wish to store checksums, though it does no harm to do so.

Data Curation

Wrangler provides services and tools for users to perform general and science domain specific data management and curation tasks on their datasets across the lifecycle of their projects, from data ingest to analysis and publication stages.

Through the Wrangler Data Portal, users will be able to track the size, growth, authenticity and integrity, and file format composition of their datasets.

Curation with iRODS

Curation on Wrangler/iRODS: Users that have data stored within iRODS can create collections and add collaborators to their data projects in the Wrangler Data Portal. Within this space, the authenticity and integrity of the data is checked on a regular basis

Using icommands, through the iRODS Web interface, or using the iDROP client users can add annotations and metadata to their files. Such metadata gets registered on the iCAT iRODS database.

iRODS has a rule engine that allows customizing and automating data management functions. Users can create scripts to make use of the iRODS rule engine for scheduling data, applying metadata, reorganizing data and other automated functions. User developed rules have to be implemented and tested on Wrangler's iRODS instance by our system administrators. Please submit a helpdesk ticket to consult with our team on how to create and implement iRODS rules.

Assigning Digital Object Identifiers (DOI)

Users that are ready to publish data stored in Wrangler can request a DOI by submitting a helddesk ticket. Publishing data through the Wrangler Data Portal is especially useful for very large datasets that can be easily reused in Wrangler or other large-scale computational resources available through XSEDE.

Note that in order to be published, data should be complete and well described. Data published with a DOI will include a full Data Cite metadata record, but users can add other descriptions, help me files, and papers to the publication package. Once the data is published with a DOI, it cannot be changed, amended or deleted. If the user wants to make a change to their data, then the new version of the dataset can be published under a new DOI and the relationship to the first version will be recorded as metadata.

If users decide at some point that they want to move their published data to another repository, they can do so by submitting a ticket with the new location of their data and we will make changes in the system so that the same DOI points to the right target.

Data Curation Software

Wrangler will provide a number of data curation software packages allowing you to perform common tasks needed for data management and curation tasks. For an exhaustive list of all software supported by Wrangler, see the TACC Software page.

Computing Environment

Unix Shell

Users logging into Wrangler directly (SSH) are presented with a choice of login shells. The shell interprets the command-line and as well as statements in shell scripts. Wrangler's default shell is BASH. Users requiring a different login shell may submit a support ticket requesting this account modification. After your support ticket is closed, please allow several hours for the change to take effect.

Startup Scripts

Unix shells allow users to customize their environment via startup files containing scripts. Customizing your environment with startup scripts is not entirely trivial. Below are some simple instructions, as well as an explanation of the shell set up operations.

TACC Bash users should consult the Bash Users' Startup Files: Quick Start Guide document for instructions on how best to set up the user environment.

Technical Background

All UNIX systems set up a default environment that provides administrators and users with the ability to execute additional UNIX commands to alter that environment. These commands are sourced; that is, they are executed by the login shell, and the variables (both normal and environmental), as well as aliases and functions, are included in the present environment. The Xeon E5 hosts on Stampede support the Bourne shell and its variants (/bin/sh, /bin/bash , /bin/zsh) and the C shell and its variants (/bin/csh, /bin/tcsh). The Linux operating system on the Xeon Phi coprocessors supports only the Bash (/bin/bash) and Bourne (/bin/sh) shells. Each shell's environment is controlled by system-wide and user startup files. TACC deploys system-specific startup files in the /etc/profile.d/ directory. User owned startup files are dot files (begin with a period and are viewed with the "ls -a" command) in the user's $HOME directory.

Each UNIX shell may be invoked in three different ways: as a login shell, as an interactive shell or as a non-interactive shell. The differences between a login and interactive shell are rather arcane. For our purposes, just be aware that each type of shell runs different startup scripts at different times depending on how it's invoked. Both login and interactive shells are shells in which the user interacts with the operating system via a terminal window. A user issues standard command-line instructions interactively. A non-interactive shell is launched by a script and does not interact with the user, for example, when a queued job script runs.

Bash shell users should understand that login shells, for example, shells launched via ssh, source one and only one of the files ~/.bash_profile,~/.bash_login, or ~/.profile (whichever the command finds first in file-list order), and will not automatically source ~/.bashrc. Interactive non-login shells, for example shells launched by typing "bash" on the command-line, will source ~/.bashrc and nothing else.

TACC staff recommends that Bash shell users use ~/.profile rather than .bash_profile or .bash_login. Please see Bash Users' Startup Files: Quick Start Guide

You may also want to restrict yourself to POSIX-compliant syntax so both shells correctly interpret your commands.

The system-wide startup scripts, /etc/profile for Bash and /etc/csh.cshrc for C type shells, set system-wide variables such as ulimit, and umask, and environment variables such as $HOSTNAME and the initial $PATH. They also source command scripts in the /etc/profile.d/directory that site administrators may use to set up the environments for common user tools (e.g., vim, less) and system utilities (e.g., Modules, Globus).

Environment Variables

Another important component of a user's environment is the set of environment variables. Many of the UNIX commands and tools, such as the compilers, debuggers, profilers, editors, and just about all applications that have GUIs (Graphical User Interfaces), will inspect the user's environment for application-specific variables. To see the variables in your environment execute the command:

login1$ env

The variables are listed as keyword/value pairs separated by an equal (=) sign, as illustrated below by the $HOME and $PATH variables.

HOME=/home/utexas/tacc_username
PATH=/bin:/usr/bin:/usr/local/apps:/opt/intel/bin

Notice that the $PATH environment variable consists of a colon (:) separated list of directories. Variables set in the environment (with setenv for C-type shells and export for Bourne-type shells) are carried to the environment of shell scripts and new shell invocations, while normal shell variables (created with the set command) are useful only in the present shell. Only environment variables are displayed by the "env" (or "printenv") command. Execute "set" to see the (normal) shell variables.

Manage your Environment with Modules

TACC continually updates application packages, compilers, communications libraries, tools, and math libraries. To facilitate this task and to provide a uniform mechanism for accessing different revisions of software, TACC uses the modules utility.

At login, modules commands set up a basic environment for the default compilers, tools, and libraries. For example: the $PATH, $MANPATH, $LIBPATH environment variables, directory locations (e.g., $WORK, $HOME), aliases (e.g., cdw, cdh) and license paths are set by the login modules. Therefore, there is no need for you to set them or update them when updates are made to system and application software.

Users that require third party applications, special libraries, and tools for their projects can quickly tailor their environment with only the applications and tools they need. Using modules to define a specific application environment allows you to keep your environment free from the clutter of all the application environments you don't need.

The environment for executing each major TACC application can be set with a module command. The specifics are defined in a modulefile file, which sets, unsets, appends to, or prepends to environment variables (e.g., $PATH, $LD_LIBRARY_PATH, $INCLUDE_PATH, $MANPATH) for the specific application. Each modulefile also sets functions or aliases for use with the application. You only need to invoke a single command to configure the application/programming environment properly. The general format of this command is:

module load modulename

where modulename is the name of the module to load. If you often need a specific application, see Controlling Modules Loaded at Login below for details.

Most of the package directories are in /opt/apps/ ($APPS) and are named after the package. In each package directory there are subdirectories that contain the specific versions of the package.

As an example, the fftw3 package requires several environment variables that point to its home, libraries, include files, and documentation. These can be set in your shell environment by loading the fftw3 module:

login1$ module load fftw3

To view a synopsis about using an application in the Modules environment (in this case, fftw3), or to see a list of currently loaded modules, execute the following commands:

login1$ module help fftw3
login1$ module list

Available Modules

TACC's module system is organized hierarchically to prevent users from loading software that will not function properly with the currently loaded compiler/MPI environment (configuration). Two methods exist for viewing the availability of modules: Looking at modules available with the currently loaded compiler/MPI, and looking at all of the modules installed on the system.

To see a list of modules available to the user with the current compiler/MPI configuration, users can execute the following command:

login1$ module avail

This will allow the user to see which software packages are available with the current compiler/MPI configuration (e.g., Intel 15 with MVAPICH2).

To see a list of modules available to the user with any compiler/MPI configuration, users can execute the following command:

login1$ module spider

This command will display all available packages on the system. To get specific information about a particular package, including the possible compiler/MPI configurations for that package, execute the following command:

login1$ module spider modulename

Software upgrades and adding modules

During upgrades, new module files are created to reflect the changes made to the environment variables. TACC will generally announce upgrades and module changes in advance.

Controlling Modules Loaded at Login

Each user's computing environment is initially loaded with a default set of modules. This module set may customized at any time. During login startup, the following command is run:

login1$ module restore

This command loads the user's personal set of modules (if it exists) or the system default. If a user wishes to have their own personal collection of modules they can create this by loading the modules they want and unloading the modules they don't and then do:

login1$ module save

This marks the collection as their personal default collection of modules that they will have every time they login. It is also possible to have named collections, run "module help" for more details.

There is a second method for controlling the module specified at login. The ".modules" file is sourced by the startup scripts at TACC and is read after the "module restore" command. This file can contain any list of module commands required. You can also place module commands in shell scripts and batch scripts. We do not recommend putting module commands in personal startup files (.bashrc, .cshrc), however; doing so can cause subtle problems with your environment on compute nodes.

Interactive Access with the iDEV command

TACC's HPC staff have recently implemented the "idev" application on Wrangler. The idev utility provides interactive access to a single node and then spawns the resulting interactive environment to as many terminal sessions as needed for debugging purposes. idev is simple to use, bypassing the arcane syntax of the srun command. Further idev documentation can be found here:https://portal.tacc.utexas.edu/software/idev

In the sample session below, a user requests interactive access to a single node for 15 minutes in order to debug the progindevelopment application. idev returns a compute node login prompt:

login1$ idev -m 15
...  
--> Sleeping for 7 seconds...OK  
...  
--> Creating interactive terminal session (login) on master node c557-704.  
...  
c557-704$ vim myprog.c
c557-704$ make myprog

Now the user may open another window to run the newly-compiled application, while continuing to debug in the original terminal session:

WINDOW2 c557-704$ ibrun -np 24 ./progindevelopment
WINDOW2 ...program output ...  
WINDOW2 c557-704$

Use the "-h" switch to see more options:

login1$ idev -h

idev provides interactive access to a single node and then spawns the resulting interactive environment to as many terminal sessions as needed for debugging purposes. Please consult the "idev User Guide".

Batch Jobs on Wrangler

Job Accounting

Jobs on Wrangler are charged differently than other HPC resources. Wrangler introduces the metric of a NODE HOUR - the use of a node and it's storage for a period of time. The user is charged for reserved nodes, whether those nodes are used for computation or not. Reservation cancellations will be refunded for the unused portions. Jobs submitted without reservations are charged for the duration of the job.

Wrangler Production Queues

The Wrangler production queues and their characteristics (wall-clock and processor limits; priority charge factor; and purpose) are listed in Table 1 below.

Currently the compute nodes of Wrangler can only be connected to 200 TB of Flash Storage at any time. With this limitation, we are currently operating Wrangler as two smaller clusters both attached to the same 200 TB of storage. Thus, there are two separate queues for processing using the DSSD. Your reservation will be made on one of the two, and you will need to submit jobs to the queue on which the reservation resides, either normal or hadoop.

In addition to the normal and hadoop queues, a smaller debug queue exists with four nodes at TACC for on-demand access to a Wrangler node or nodes for code development or debugging. These nodes cannot be reserved via the Portal and have a 4 hour maximum run time.

Table 1. Wrangler Production Queues at TACC

Queue Name Max Runtime # Nodes in Queue SU Charge Rate Purpose
normal 48 hrs 48 1 SU per node hour access the GPFS flash storage
hadoop 48 hrs 48 1 SU per node hour run hadoop jobs
debug 4 hrs 10 1 SU per node hour debugging code

Table 2. Wrangler Production Queues at IU

Queue Name Max Runtime # Nodes in Queue SU Charge Rate Purpose
all 5 days 23 1 SU per node hour production

SLURM Job scheduler

Batch facilities such as LoadLeveler, LSF, SGE and SLURM differ in their user interface as well as the implementation of the batch environment. Common to all, however, is the availability of tools and commands to perform the most important operations in batch processing: job submission, job monitoring, and job control (cancel, resource request modification, etc.).

The SLURM job scheduler is documented extensively in the Stampede User Guide. Please refer to the respective sections in that guide for information on SLURM commands to view queue status, submit and monitor jobs and more.

Parametric Launcher

The Parametric Launcher is a simple way to encapsulate many small tasks within a single SLURM job. The system starts up as many individual tasks as it has capacity for and then continue to run subsequent jobs once others complete. The benefit is that one can easily run multiple tasks on a single or suite of nodes and use the full CPU, memory, or IO capabilities of that node in a single managed job rather than having to orchestrate the distribution of tasks themselves.

Follow the steps below to submit a parametric job:

  1. Load the "launcher" module to set some default parameters:

     login1$ module load launcher
    
  2. This sets the $TACC_LAUNCHER_DIR environment variable that points to the the launcher files. Copy the "launcher.slurm" script into your home directory or where you keep your slurm scripts.

     login1$ ls $TACC_LAUNCHER_DIR
     hello      init_launcher  launcher.sge    paramlist  paramrun.sge    phiforward    README
     hello.f90  launcher       launcher.slurm  paramrun   paramrun.slurm  phiparamlist  tskserver
     login1$ cp $TACC_LAUNCHER_DIR/launcher.slurm .
  3. Edit the "launcher.slurm" script and make the following changes to customize for Wrangler:

     #------------------Scheduler Options--------------------
     #SBATCH -J Parametric          # Job name
     #SBATCH -N 1                   # Total number of nodes (24 cores/node)
     #SBATCH -n 24                  # Total number of tasks
     #SBATCH -p normal              # Queue name
     #SBATCH -o Parametric.o%j      # Name of stdout output file (%j expands to jobid)
     #SBATCH -t 01:00:00            # Run time (hh:mm:ss)
     ...
     ...
     #------------------General Options---------------------
     ...
     export TACC_LAUNCHER_PPN=24
    • Change the "-p" option to "normal" or "debug"
    • Set the "-t" option to the maximum time the job is allowed to run.
    • Set the "-N" and "-n" options to the number of nodes needed, and how many jobs a node can support, respectively. For most single threaded jobs "-n" should be set to 24 unless the jobs have other memory or IO constraints when running in parallel on a single node.
    • Change the $TACC_LAUNCHER_PPN value to 24
  4. Create and edit a new file, "paramlist", with each line being a single command line task to be run

     myprogram -i data1 -o output1 >& run1.out
     myprogram -i data2 -o output2 >& run2.out
     myprogram -i data3 -o output3 >& run3.out
     myprogram -i data4 -o output4 >& run4.out
     myprogram -i data5 -o output5 >& run5.out
     myprogram -i data6 -o output6 >& run6.out
  5. Submit the job:

    login1$ sbatch launcher.slurm

    It will then run until all jobs have completed or the time expires from the -t parameter. If there are more jobs than the number of nodes times the number of tasks, it will run as many as it can and when individual jobs complete, will then start the next job in the list until all jobs have been run. So it is not necessary to have the number of tasks in paramlist be equal to the number of nodes times the number of tasks on a node (and in many cases of short jobs lasting each minutes rather than hours, you can simply have it run on a small number of nodes but for a long enough time to do all the tasks).

Running Hadoop & HDFS Jobs on Wrangler

Once the Hadoop cluster started by the start of the reservations for the project, each user who will interact with the Hadoop cluster will need to submit another individual jobs to access a node in the reservation. Typically this job will be a simple idev session (launched from the command line on the login node that automatically log the user onto a node), but can be any other batch processing session (e.g. a slurm batch job that does the data upload, computation, and/or data retrieval for the user in a non-interactive session). Each job will only need request one node from the reservation pool for the purpose of interacting with the hadoop cluster.

Typically the user can simply type "idev" to get a login. The user need specify on the command line to use the reservation has been made for Hadoop cluster. For example:

login1$ idev -A myBigAllocation -r hadoopReservation2

will start a ssh session against the myBigAllocation project using the hadoopReservation2 reservation. Note that each user will be tied to a node of the Hadoop cluster and that only as many users as there are nodes in the hadoop reservation will be able to login to the system at a time. We encourage users to login to a node to do their data ingest, submit jobs, and data retrieval, but to not stay logged into a node during idle time to free the session up for others to use.

Once a hadoop application command has been submitted for YARN scheduler of the Hadoop cluster, the user can log out the idev sessions if there is no further interaction needed. The application remains in Hadoop cluster will run to their finish independent from the slurm job session.

Here is an example slurm job script file for submitting a Hadoop job.

#SBATCH -J MyHadoop               # Job name
#SBATCH -o myjob.%j.out           # Name of stdout output file 
#SBATCH -p hadoop                 # Queue name
#SBATCH --reservation=myHadoopRes # Name of the reservation to be used 
#SBATCH -N 1                      # Total number of nodes requested (24 cores/node)
#SBATCH -n 1                      # Total number of MPI tasks requested
#SBATCH -t 01:30:00               # Run time (hh:mm:ss) - 1.5 hours
#SBATCH -A MyAllocaiton           # Allocation name to charge job against
hadoop jar MyJarFile.jar MainClass args

Common Hadoop Commands

Users can invoke hadoop commands directly from the node they are logged into.

hadoop command [generic_options] [command_options]

For example:

  • Run a Java application:

    login1$ hadoop jar MyJarFile.jar MainClass [args]
  • Run a Hadoop streaming job:

    login1$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
          -mapper mapper_cmd \
          -reducer reducer_cmd \
          -input input_path_HDFS \
          -output output_path_HDFS

In addition to running a ".jar" file directly with a Hadoop command, a user can also submit a job specified by a job script with the "mapred" command.

mapred [--config confdir] command

Examples:

  • To submit a job specified by MyMR.job, a user can run following:

    login1$ mapred job -submit MyMR.job
  • Lists current running job in Hadoop cluster:

    login1$ mapred job -list
  • To check the status of the job:

    login1$ mapred job -status jobid
  • To kill an active MapReduce job:

    login1$ mapred job -kill jobid

HDFS Commands

The default working directory under hdfs for each user also is also the user's home directory under the hdfs at "/user/username". Please contact us if the directory is not available.

A user can interact with HDFS using hdfs commands. Here are a few common commands:

  • List the current working directory contents:

    login1$ hdfs dfs -ls
  • Copy file "foo.test" from the local file system to HDFS and rename it to "foo_in_hdfs.test"

    login1$ hdfs dfs --copyFromLocal foo.test foo_in_hdfs.test
  • Copy file "foo_in_hdfs.test" from HDFS to the local file system and rename it to "foo_from_hdfs.test"

    login1$ hdfs dfs --copyToLocal foo_in_hdfs.test foo_from_hdfs.test
  • Remove file "foo.test" in the current HDFS working directory

    login1$ hdfs dfs -rm foo.test

Please see the Hadoop command line guide for further information.

Sample Hadoop Job Script

Wrangler Architecture

The Wrangler Data and Analytics system is comprised of three primary subsystems to accommodate data research:

  1. A 10 PB disk based storage system
  2. A cluster of 96 Intel Haswell based analytics servers
  3. A 0.5 PB shared flash storage system able to support data I/O at unprecedented rates across the analytics system

Each analytics node has 24 cores and 128 GB of volatile memory with both Infniband FDR and 40 Gb/s Ethernet connectivity. The flash storage system, a product of DSSD, will be capable of supporting throughput to the analytics cluster at rates of 1 TB/s and is able to support transactions at the rate of more than 200 million IOPS. This primary system will reside in Austin at TACC. In parallel to these systems is a replicated copy of the disk based file system and a smaller 24 node analytics cluster resides at Indiana University. Both systems are connected to Internet 2 via the TACC and IU 100 Gb/s link, thus giving Wrangler a maximum potential network throughput of 200 Gb/s for ingesting and accessing data.

Wrangler Data Analysis Environment

Supporting research on this hardware will be a software stack providing support for the transfer, access, curation, processing, analysis, and cataloging of data. Data transfer and management via Globus will allow users to transfer information in and out of the system at higher rates than standard single threaded transfer mechanisms. Storage at both sites will be set aside for the delivery of data, the extraction, loading, and transformation of data prior to processing and analysis. Data can then be stored either directly within the replicated Lustre file system allocated to house users working data and results, within an iRODS instance for users wanting a higher level of data curation and sharing opportunities, or within the RDMBS and noSQL based systems hosted on Wrangler.

Two primary environments for analytics will be supported on Wrangler. Users will be able to submit jobs in a UNIX batch scripting environment familiar to many users of other HPC systems. Support optimizing many commonly used data processing and analysis packages will be provided, including optimized versions of R, Python, as well as domain specific applications and packages such as BLAST, astroPy, and spider. Support for common libraries providing parallelism in both computation and IO will also be provided. Users will be able to leverage either or both storage systems as well as databases in their workflows.

In addition to the UNIX environment, Wrangler also provides users with a Hadoop environment. Built on the DSSD storage system, this somewhat unique environment will take advantage of both its speed and the all-to-all capability. Users will be able to allocate data nodes scaled to their processing needs rather than their storage needs. Support will also be provided for Pig and Mahout based jobs.

Wrangler Flash Storage

Access to the flash storage will be presented to the user in three different modes:

  1. A single common HDFS file system supporting Hadoop jobs
  2. A POSIX file system supporting workflows using POSIX file access
  3. an object store based on the flood API from DSSD. Details about the usage of this API will be provided for application developers

Migration of data to and from the DSSD storage will be a key component of working with Wrangler. As this migration has the potential to add significant overhead to workloads using the storage, we intend to schedule usage of the Flash memory system in more of a campaign mode than one supporting individual jobs. Projects will be given a storage quota for a given period of time. These longer term reservations will be scheduled by TACC staff to ensure proper allocation of this resource.

Wrangler Lustre File System

The Lustre file systems at TACC and Indiana are two identical systems. Each houses 10 PB of 6 TB disks hosted in 35 Object Storage Servers (OSS). Each OSS is a Dell MD3220 and MD1220 storage arrays housing 48 300GB 15K RPM 6Gbps SAS drives in RAID-10 configuration at each site providing enough capacity to support more than 3 billion inodes at both TACC and IU. The Lustre filesystems at each site will use 34 OSSes and will provide more than 90 GB/s of performance. The Lustre Metadata Servers (MDS) are hosted on two Dell R720 with dual Intel Xeon E5-2680 processors. The two MDS will be configured to act as active/passive failover pairs. All storage is connected to the analytics cluster via the 54 Gb/s Mellenox FDR fabric with 120 lanes providing full all-to-all connectivity.

Analytics server details

The analytic systems are Dell R730 servers each with two Intel Haswell E5-2680-v3 CPUs each with 12 cores, 128 GB of 1600 MHz DDR4 memory, and one 146 GB local SAS hard drive for the local OS and software installation.

Each of the 96 nodes located at TACC will connect to three network fabrics: a switched PCI environment to connect to the high speed storage tier, an InfiniBand (IB) fabric to connect to the bulk storage tier, and a 40 Gbps Ethernet (40GigE) fabric to connect to public networks. This subsystem is sized to take maximum advantage of the available bandwidth in the high speed DSSD subsystem. Each node will use one of its PCI Gen3 x16 slots to hold a card supporting 12 GB/s of bandwidth and 10 PCI Gen3 x4 ports, one to connect to each DSSD chassis. The other PCI slot in these nodes will hold a dual-port Mellanox ConnectX-3 FDR IB and 40GigE card. The 24 nodes at IU will have similar IB and 40GigE connectivity.

The nodes in this subsystem will serve multiple roles. They will function as data movers to move data between the subsystems as well as to users and other systems. They will provide embedded analytics capabilities to run user jobs directly on these compute platforms without the need to migrate large datasets outside of Wrangler. Finally, they will act as data servers to external applications that make use of the datasets hosted within the system.

Table 3. Technical Specifications

Component Technology Performance/Size
Compute Nodes (TACC) dual CPU 12 Core Xeon E5-2680-v3 96 nodes/ 2304 cores
Memory (TACC) Distributed DDR4 12.2 TB
Flash Storage DSSD PCI attached NAND Flash 0.5 PB (200 TB Shared POSIX, 200 TB HDFS, 100 TB local attached/Object store)
Compute Nodes (IU) dual CPU 12 core Xeon E5-2680-v3 24 nodes/576 cores
Shared Disk Lustre, parallel file system 10 PB replicated, 34 OSS, 153 OST
Interconnect Infiniband Mellanox Switch FDR 54 Gbit/s Infiiband 40 Gb/s Ethernet

File Systems & Storage

Wrangler differentiates between two types of storage, persistent and reserved. Persistent storage, familiar to users as the standard HOME, WORK and SCRATCH file systems, each with their own quotas, backup and replication policies, is available to the user to store data between reservations on the system. Data on these file systems will not be purged during the duration of projects allocation and, in the case of the project storage area, may be available for data retrieval even after the allocation has finished. These areas are all quotaed, giving users a reliable storage area that will not be filled up by other projects of Wrangler. Persistent storage is meant to hold files, applications, and other items during the entirety of a project allocation. They are not tied to reservations or jobs periods and leverage Wranglers 10+ PB of disk based storage at TACC and IU.

Reserved storage is temporary storage lasting for the duration of a job or reservation on the system. Reserved storage is intended for users to ingest data into, process, and extract results from, over the course of an individual job or project reservation. Data left in reserved storage is subject to immediate purge once the reservation has completed.

HOME directory

Each Wrangler user is granted their own HOME directory in the /home file system with a quota of 50 MB. The path to this directory is stored in the $HOME environment variable. This storage is local to the TACC and IU environments and is not replicated between the two sites. HOME is backed up by TACC and IU periodically. Use the HOME filesystem for configuration files, source code, etc.

WORK directory

At TACC the shared work file system, called Stockyard, is mounted on /work. The location of this directory is stored in the users $WORK environment variable.

Each TACC resource isolates its own WORK file system on Stockyard. To access your WORK area on other systems, please use the $STOCKYARD variable followed by the name of the system. Wrangler's WORK area is $STOCKYARD/wrangler while maverick's is $WORK/maverick). Due to past environments, the WORK area for Stampede is not $STOCKYARD/stampede but simply $STOCKYARD. This space is not backed up by TACC or IU staff.

Each TACC user has access to 1 TB of storage across all systems at TACC. If a user has accounts on multiple TACC resources, e.g. Stampede, Maverick, then each accounts' WORK space usage counts toward the 1TB quota.

DATA directory

Each user has an individual area on /data pointed to by $DATA. This space is a shared environment and allocation of space is quota'd for each project. This space is not backed up and is intended for users to store their own data and/or interim results prior to integration with the overall project storage. This area is not backed up to archival media, but is replicated between the TACC and IU storage systems. To check both the individual and group quota usage on /data please use Lustre's "lfs" command.

login1$ lfs quota /data
Disk quotas for user ngaffney (uid 817221):
	Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
	    /data 41943132       0       0       -       3       0       0       -
Disk quotas for group G-814305 (gid 814305):
	Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
	    /data 41943144       0       0       -       6       0       0       -

Projects directory

This storage is intended for projects to use to support data for the entire project. Projects will find their storage area in data/projects/projectname. To find the name or number of your project or projects, use the "groups" command. Each group available to a user reflects a project they are in. To understand which project relates to which group, please consult the TACC user portal. This storage is also part of the /data quota and users should use the same lfs quota /data command to check their usage. We note that groups will be responsible for managing their overall quota for all users $DATA and their /data/projects area. This area is not backed up to archival media, but is replicated between the TACC and IU storage systems.

iRODS storage

iRODS provides a managed data storage environment with additional functionality to help in the sharing, publishing, and preservation of data. With Wrangler, iRODS can be used to store data and have files sizes and checksums (also known as FIXITY) cataloged and periodically verified to ensure data integrity. Because data are stored within the system, they are not directly available to standard application file interfaces (e.g. you cannot open a file with standard C or Python file open commands). Users can put files into their projects iRODS space, organized them in a familiar directory structure, search for files based on specific metadata cataloged for each file, and retrieve files from iRODS to a local file system. iRODS also provides users with a web-based interface for file ingest, management, and retrieval as well as collaborative sharing features to share data with specific people or the public at large. All files stored in iRODS are replicated between TACC and IU.

Reserved Storage

Reserved storage is storage on the DSSD Flash storage system at TACC. Because this storage is limited to 0.5 PB and is a shared by projects, it is intended to be used during the a service reservation for the storage only. Note hosted databases can also use this Reserved storage if needed. All usage of Flash storage is billed according to the amount of space set aside for the reservation or database schema, not the amount used. The rate is 1 SU per hour per 4 TB of allocated storage or use of 1 compute node, whichever is greater.

HDFS on Flash

This storage is setup and removed at the start and end of a Hadoop Cluster reservation.

GPFS on Flash

Projects space will be allocated at the start and deallocated at the end of a reservation. Data stored on the system after the end or termination of a reservation is subject to purge based on the needs of other projects using the system.

Table 4. Wrangler File Systems at TACC

Filesystem Type Quota Size Replicated Env var
Persistent /home NFS 50MB 6.3TB no $HOME
/work Lustre 1TB 20PB no $WORK
/data/projects Lustre allocation dependent 10PB yes $DATA
Reserved /hdfs HDFS - reservation dependent no
/gpfs/flash GPFS - reservation dependent no

Table 5. Wrangler File Systems at IU

Filesystem Type Quota Size Replicated Env var
Persistent /home 50MB 6.3TB no $HOME
/data/projects Lustre allocation dependent 10PB yes $DATA

Sharing Files

Users often wish to collaborate with fellow project members by sharing files and data with each other. Project managers or delegates can create shared workspaces, areas that are private and accessible only to other project members, using UNIX group permissions and commands. Shared workspaces may be created as read-only or read-write, functioning as data repositories and providing a common work area to all project members. Please see Sharing Project Files on TACC Systems for step-by-step instructions.

Transferring Files to and from Wrangler

There are several transfer mechanism for data to Wrangler, some of which depend on where and how the data are to be stored. Please review the following transfer mechanisms.

  • SSH utilities: standard SSH file transfer mechanisms scp & rsync
  • Globus utilities: Globus GUI and command-line utilities employ parallel data transfer mechanisms
  • Cyberduck: SSH GUI utility for Windows users
  • iRODS - commands and GUI to transfer data to and from the iRODS data repository on Wrangler
  • Data Dock - Sending media (e.g. USB3 hard drives) to TACC to be uploaded to Wrangler

Transferring Using SSH Utilities: scp & rsync

The scp and rsync commands are standard data transfer mechanisms used to transfer moderate size files and data collections between systems. These applications use a single thread to transfer each file one at a time. The scp and rsync utilities are typically the best methods when transferring Gigabytes of data. For larger data transfers, parallel data transfer mechanisms, e.g., Globus, can often improve total throughput and reliability.

scp

Data transfer from any Linux system can be accomplished using the scp utility to copy data to and from the login node. A file can be copied from your local system to the remote server by using the command:

localhost% scp filename \ 
	TACC-username@wrangler.tacc.utexas.edu:/path/to/project/directory

Consult the man pages for more information on scp.

login1$ man scp

rsync

The rsync command is another way to keep your data up to date. In contrast to scp, rsync transfers only the actual changed parts of a file (instead of transferring an entire file). Hence, this selective method of data transfer can be much more efficient than scp. The following example demonstrates usage of the rsync command for transferring a file named "myfile.c" from its current location on Stampede to Wrangler's $DATA directory.

login1$ rsync myfile.c \    
	TACC-username@wrangler.tacc.utexas.edu:/data/01698/TACC-username/data

An entire directory can be transferred from source to destination by using rsync as well. For directory transfers the options "-avtr" will transfer the files recursively ("-r" option) along with the modification times ("-t" option) and in the archive mode ("-a" option) to preserve symbolic links, devices, attributes, permissions, ownerships, etc. The "-v" option (verbose) increases the amount of information displayed during any transfer. The following example demonstrates the usage of the "-avtr" options for transferring a directory named "gauss" from the present working directory on Stampede to a directory named "data" in the $WORK file system on Wrangler.

login1$ rsync -avtr ./gauss \    
	TACC-username@wrangler.tacc.utexas.edu:/data/01698/TACC-username/data

For more rsync options and command details, run the command "rsync -h" or:

login1$ man rsync

When executing multiple instantiations of scp or rsync, please limit your transfers to no more than 2-3 processes at a time.

Transferring Using Cyberduck

TACC staff recommends the open-source Cyberduck utility for both Mac and Windows users that do not already have a preferred tool.

Download Cybercuck here

Click on the "Open Connection" button in the top right corner of the Cyberduck window to open a connection configuration window (as shown below) transfer mechanism, and type in the server name "wrangler.tacc.utexas.edu" or "wrangler.uits.iu.edu". Add your username and password in the spaces provided, and if the "more options" area is not shown click the small triangle or button to expand the window; this will allow you to enter the path to your project area so that when Cyberduck opens the connection you will immediately see your data. Then click the "Connect" button to open your connection.

Once connected, you can navigate through your remote file hierarchy using familiar graphical navigation techniques. You may also drag-and-drop files into and out of the Cyberduck window to transfer files to and from Corral.

Transfer to Wrangler using Cyberduck

Transferring using Globus utilities

Globus Connect and the Globus command line utilities provide users mechanisms for transferring files using the globus transfer protocols. Users can create a Globus Connect account, download the Globus Connect clients to install on their own systems, interact with the Globus Connect system and learn about all of the features of Globus Connect at the Globus site.

Globus globus-url-copy command

XSEDE users may also use Globus' globus-url-copy command-line utility to transfer data between XSEDE sites. globus-url-copy, like Globus Connect, is an implementation of the GridFTP protocol, providing high speed transport between GridFTP servers at XSEDE sites. The GridFTP servers mount the specific file systems of the target machine, thereby providing access to your files or directories. Users can also use thier own personal Globus endpoints to transfer data to and from their own systems.

This command requires the use of an XSEDE certificate to create a proxy for passwordless transfers. To obtain a proxy, use the "myproxy-logon" command with your XSEDE User Portal (XUP) username and password to obtain a proxy certificate. The proxy is valid for 12 hours for all logins on the local machine. On Wrangler, the myproxy-logon command is located in the CTSSV4 module (not loaded by default).

login1$ module load CTSSV4 
login1$ myproxy-logon -T -l XSEDE-username

Each globus-url-copy invocation must include the name of the server and a full path to the file. The general syntax looks like:

globus-url-copy [options] source_url destination_url

where each XSEDE URL will generally be formatted:

gsiftp://gridftp_server/path/to/file

Users may look up XSEDE GridFTP servers on the Data Transfers & Management page.

Note that globus-url-copy supports multiple protocols e.g., HTTP, FTP in addtion to the GridFTP protocol. Please consult the following references for more information.

globus-url-copy Examples

The following command copies "directory1" from TACC's Wrangler to Indiana University's Mason system, renaming it to "directory2". Note that when transferring directories, the directory path must end with a slash ( "/"):

login1$ globus-url-copy -r -vb \    
    gsiftp://gridftp.wrangler.tacc.xsede.org:2811/`pwd`/directory1/ \    
    gsiftp://mason.iu.xsede.org:2811/home/00000/johndoe/directory2/

Transferring Using GSI-OpenSSH

Additional command-line transfer utilities supporting standard SSH and grid authentication protocols are offered by the Globus GSI-OpenSSH implementation of OpenSSH. The gsissh, gsiscp and gsiftp commands are analogous to the OpenSSH ssh, scp and sftp commands. Grid authentication is provided to XSEDE users by first executing the myproxy-logon command (see above).

Users who need to transfer large amounts of data to Wrangler may find it worthwhile to disable gsiscp's default data stream encryption. To do so, add the following three options:

  • -oTcpRcvBufPoll=yes
  • -oNoneEnabled=yes
  • -oNoneSwitch=yes

to your command-line invocation. Note that not all machines support these options. You must explicitly connect to port 2222 on Wrangler. The following command copies "file1" on your local machine to Wrangler renaming it to "file2".

localhost$ gsiscp -oTcpRcvBufPoll=yes -oNoneEnabled=yes -oNoneSwitch=yes \    
	-P2222 file1 wrangler.tacc.xsede.org:file2

Please consult Globus' GSI-OpenSSH User's Guide for further info.

Transferring Using iRODS

There are several mechanisms for transferring data to and from Wrangler using iRODS. To access the iRODS command line utilities, please load the iRODS module.

login1$ module load irods

Please refer to the TACC iRODS software page for more information.

Transferring Using Data Dock

While Wrangler is connected to Internet2 at both TACC and IU with a capacity of 100 GB/s, some projects or users may find their data volume is too great, or their local connectivity too slow to effectively transfer data to Wrangler. TACC staff can work with users to leverage Wranglers Data Dock, allowing users to send physical media to TACC to ingest onto the Wrangler system. For more information about the Data Dock or to set up transfer services, please file a ticket at the TACC User Portal, or contact TACC data staff at data@tacc.utexas.edu for more information.

Help

TACC, XSEDE, and Indiana University offers several means of user support for Wrangler

Policies

TACC resources are deployed, configured, and operated to serve a large, diverse user community. It is important that all users are aware of and abide by TACC Usage Policies. Failure to do so may result in suspension or cancellation of the project and associated allocation and closure of all associated logins. Illegal transgressions will be addressed through UT and/or legal authorities.

References

*Wrangler is generously funded by the National Science Foundation (NSF) through award ACI-1447307, "Wrangler: A Transformational Data Intensive Resource for the Open Science Community".

Last update: June 21, 2016