hadoop join之map side join详解大数据

在本例中，我们仍然采用上一例中的数据文件。之所以存在reduce side join，是因为在map阶段不能获取所有需要的join字段，即：同一个key对应的字段可能位于不同map中。Reduce side join是非常低效的，因为shuffle阶段要进行大量的数据传输。Map side join是针对以下场景进行的优化：两个待连接表中，有一个表非常大，而另一个表非常小，以至于小表可以直接存放到内存中。这样，我们可以将小表复制多份，让每个map task内存中存在一份（比如存放到hash table中），然后只扫描大表：对于大表中的每一条记录key/value，在hash table中查找是否有相同的key的记录，如果有，则连接后输出即可。为了支持文件的复制，Hadoop提供了一个类DistributedCache，使用该类的方法如下：（1）用户使用静态方法DistributedCache.addCacheFile()指定要复制的文件，它的参数是文件的URI（如果是HDFS上的文件，可以这样：hdfs://jobtracker:50030/home/XXX/file）。JobTracker在作业启动之前会获取这个URI列表，并将相应的文件拷贝到各个TaskTracker的本地磁盘上。（2）用户使用DistributedCache.getLocalCacheFiles()方法获取文件目录，并使用标准的文件读写API读取相应的文件。

本实例中的运行参数需要三个，加入在hdfs中有两个目录input和input2，其中input2存放user.csv，input存放order.csv，则运行命令格式如下：hadoop jar xxx.jar JoinWithDistribute input2/user.csv input output。

具体实例如下，此实例我们采用旧的API来写

 
public class JoinWithDistribute extends Configured implements Tool 
{ 
  
    public static class MapClass extends MapReduceBase  
        implements Mapper<LongWritable, Text, Text, Text> 
    { 
  
        //用于缓存小表的数据，在这里我们缓存user.csv文件中的数据 
        private Map<String, String> users = new HashMap<String, String>(); 
  
        private Text outKey = new Text(); 
  
        private Text outValue = new Text(); 
  
        //此方法会在map方法执行之前执行 
        @Override 
        public void configure(JobConf job) 
        { 
            BufferedReader in = null; 
  
            try 
            { 
                //从当前作业中获取要缓存的文件 
                Path[] paths = DistributedCache.getLocalCacheFiles(job); 
                String user = null; 
                String[] userInfo = null; 
  
                for (Path path : paths) 
                { 
                    if (path.toString().contains("user.csv")) 
                    { 
                        in = new BufferedReader(new FileReader(path.toString())); 
                        while (null != (user = in.readLine())) 
                        { 
                            userInfo = user.split(",", 2); 
                            //缓存文件中的数据 
                            users.put(userInfo[0], userInfo[1]); 
                        } 
                    } 
                } 
            } 
            catch (IOException e) 
            { 
                e.printStackTrace(); 
            } 
            finally 
            { 
                try 
                { 
                    in.close(); 
                } 
                catch (IOException e) 
                { 
                    e.printStackTrace(); 
                } 
            } 
        } 
  
        public void map(LongWritable key, Text value, OutputCollector<Text, Text> output,  
                Reporter reporter) throws IOException 
        { 
            //首先获取order文件中每条记录的userId， 
            //再去缓存中取得相同userId的user记录，合并两记录并输出之。 
            String[] order = value.toString().split(","); 
            String user = users.get(order[0]); 
             
            if(user != null) 
            { 
                outKey.set(user); 
                outValue.set(order[1]); 
                output.collect(outKey, outValue); 
            } 
        } 
  
    } 
  
    public int run(String[] args) throws Exception 
    { 
        JobConf job = new JobConf(getConf(), JoinWithDistribute.class); 
  
        job.setJobName("JoinWithDistribute"); 
        job.setMapperClass(MapClass.class); 
        job.setNumReduceTasks(0); 
  
        job.setInputFormat(TextInputFormat.class); 
        job.setOutputFormat(TextOutputFormat.class); 
  
        job.setMapOutputKeyClass(Text.class); 
        job.setMapOutputValueClass(Text.class); 
         
        //我们把第一个参数的地址作为要缓存的文件路径 
        DistributedCache.addCacheFile(new Path(args[0]).toUri(), job); 
        FileInputFormat.setInputPaths(job, new Path(args[1])); 
        FileOutputFormat.setOutputPath(job, new Path(args[2])); 
  
        JobClient.runJob(job); 
  
        return 0; 
    } 
  
    public static void main(String[] args) throws Exception 
    { 
        int res = ToolRunner.run(new Configuration(), new JoinWithDistribute(), args); 
        System.exit(res); 
    } 
  
}

转发:https://blog.csdn.net/huashetianzu/article/details/7821674

原创文章，作者：奋斗，如若转载，请注明出处：https://blog.ytso.com/9806.html

hadoop join之map side join详解大数据

相关推荐

发表回复